4.3 Evaluation, Monitoring, and Cost Awareness
Key Takeaways
- AI-901 scenario selection includes operational judgment: validate quality, safety, grounding, agent behavior, and cost before calling a prototype production-ready.
- Microsoft Foundry evaluations can run against models, agents, datasets, or traces and can use built-in agent, quality, and safety evaluators.
- Foundry observability combines evaluation, monitoring, and tracing through signals such as logs, traces, model outputs, latency, error rates, token consumption, and quality scores.
- Cost awareness includes model tokens, service transactions, fine-tuned model hosting, evaluations, dependent services such as search, and Azure Monitor or alerting charges.
- Use representative data, edge cases, human review, budgets, alerts, and Cost Management meter analysis to keep AI behavior and spending aligned with the intended use.
Production Readiness Is Part Of Service Selection
A service that works in a demo can still fail in production. AI-901 does not require deep site reliability engineering, but Microsoft Foundry objectives include building apps and agents, and official Foundry documentation emphasizes evaluations, monitoring, tracing, guardrails, and cost management. For exam scenarios, read for the missing operational step. Sometimes the right answer is to evaluate, monitor, set a guardrail, or inspect cost rather than to choose a different model.
What To Measure
| Concern | Useful signal | Foundry or Azure control | Scenario clue |
|---|---|---|---|
| Factual quality | Groundedness, relevance, completeness, coherence | Foundry quality evaluators and RAG checks | The answer must match source documents |
| Safety | Hate or unfairness, violence, sexual content, self-harm, protected material | Safety evaluators, Content Safety, guardrails | Open user prompts or public output |
| Agent behavior | Tool selection, tool input accuracy, task completion, tool output use | Agent evaluators and traces | The agent calls APIs or takes actions |
| Operations | Latency, error rate, throughput, failures, prompt size | Monitoring dashboards and Application Insights | Users complain the app is slow or inconsistent |
| Cost | Token consumption, service meters, deployment charges, evaluation usage | Cost Management, budgets, pricing calculator | A prototype becomes expensive as traffic grows |
Evaluation Targets
Foundry evaluations can target an agent, a model, a dataset of existing outputs, or traces from previous interactions. That flexibility matters on exam questions. Use model evaluation when comparing prompt or model output for a single-turn task. Use agent evaluation when tool use, intent resolution, task completion, or multi-step behavior matters. Use dataset evaluation when outputs have already been generated and you need to score them in bulk. Use trace evaluation when production or test interactions are already captured.
The official evaluation flow also asks you to choose a data source. Synthetic scenarios can help before launch, especially for agents. Existing conversations reveal real behavior after launch. Prepared CSV or JSONL datasets support repeatable benchmarking. For sensitive apps, include edge cases, not only clean success examples.
Monitoring And Tracing
Observability is the ability to understand and troubleshoot AI systems over their lifecycle. In Foundry, monitoring is integrated with Azure Monitor Application Insights, and tracing can show model calls, tool invocations, agent decisions, and dependencies. That is especially important for agents because a bad final answer might come from a wrong tool choice, a malformed tool input, a weak retrieved context, or an unsafe model output.
Monitoring is also a responsible AI control. If a helpdesk bot starts producing unsupported answers, a quality dashboard or groundedness trend can reveal drift. If a public app receives harmful prompts, safety metrics and content-filter events can show whether the policy is working. If latency rises, traces can show whether the bottleneck is retrieval, model inference, tool calls, or downstream services.
Cost-Aware Selection
Foundry and Foundry Tools do not produce one simple bill. Costs can come from model tokens, Azure Speech or Language transactions, Content Understanding analysis, Content Safety checks, Azure AI Search, storage, networking, Application Insights, and evaluations. Language, vision, and audio models may all use token-based metering, and input plus output both matter. Fine-tuned deployments can also create hosting cost while deployed.
Use the Azure pricing calculator before rollout, then use Cost Management to compare actual charges by resource and meter. Budgets and alerts help detect surprises, but they are not a substitute for understanding the architecture. If a team asks why a chatbot is expensive, inspect token usage, model choice, prompt length, retrieved context size, traffic volume, evaluation settings, and dependent resources.
Evaluation-To-Production Loop
- Define the intended use, prohibited behavior, and success criteria.
- Build representative test data with normal cases, edge cases, and abuse cases.
- Run model or agent evaluations with quality, safety, and agent-specific evaluators.
- Inspect failures, adjust prompts, tools, retrieval, guardrails, or service choice.
- Deploy with monitoring, tracing, content safety, budgets, and human escalation.
- Re-evaluate production samples on a schedule and tune thresholds from evidence.
For AI-901, remember the exam-level message: service selection is not finished when the model responds once. A production-minded answer measures quality, safety, cost, and behavior over time.
A team has a Foundry support agent that can look up orders and start refunds. Before production, the team wants to verify tool choice, grounded policy answers, and harmful-output handling. What should they do?
A prototype chatbot becomes more expensive after launch. Which investigation is most aligned with Foundry cost guidance?