4.3 Evaluation, Monitoring, and Cost Awareness

Key Takeaways

AI-901 scenario selection includes operational judgment: validate quality, safety, grounding, agent behavior, and cost before calling a prototype production-ready.
Microsoft Foundry evaluations can run against models, agents, datasets, or traces and can use built-in agent, quality, and safety evaluators.
Foundry observability combines evaluation, monitoring, and tracing through signals such as logs, traces, model outputs, latency, error rates, token consumption, and quality scores.
Cost awareness includes model tokens, service transactions, fine-tuned model hosting, evaluations, dependent services such as search, and Azure Monitor or alerting charges.
Use representative data, edge cases, human review, budgets, alerts, and Cost Management meter analysis to keep AI behavior and spending aligned with the intended use.

Last updated: June 2026

Production Readiness Is Part Of Service Selection

A service that works in a demo can still fail in production. AI-901 does not require deep site reliability engineering, but Microsoft Foundry objectives include building apps and agents, and official Foundry documentation emphasizes evaluations, monitoring, tracing, guardrails, and cost management. For exam scenarios, read for the missing operational step. Sometimes the right answer is to evaluate, monitor, set a guardrail, or inspect cost rather than to choose a different model.

What To Measure

Concern	Useful signal	Foundry or Azure control	Scenario clue
Factual quality	Groundedness, relevance, completeness, coherence	Foundry quality evaluators and RAG checks	The answer must match source documents
Safety	Hate or unfairness, violence, sexual content, self-harm, protected material	Safety evaluators, Content Safety, guardrails	Open user prompts or public output
Agent behavior	Tool selection, tool input accuracy, task completion, tool output use	Agent evaluators and traces	The agent calls APIs or takes actions
Operations	Latency, error rate, throughput, failures, prompt size	Monitoring dashboards and Application Insights	Users complain the app is slow or inconsistent
Cost	Token consumption, service meters, deployment charges, evaluation usage	Cost Management, budgets, pricing calculator	A prototype becomes expensive as traffic grows

Evaluation Targets

Foundry evaluations can target an agent, a model, a dataset of existing outputs, or traces from previous interactions. That flexibility matters on exam questions. Use model evaluation when comparing prompt or model output for a single-turn task. Use agent evaluation when tool use, intent resolution, task completion, or multi-step behavior matters. Use dataset evaluation when outputs have already been generated and you need to score them in bulk. Use trace evaluation when production or test interactions are already captured.

The official evaluation flow also asks you to choose a data source. Synthetic scenarios can help before launch, especially for agents. Existing conversations reveal real behavior after launch. Prepared CSV or JSONL datasets support repeatable benchmarking. For sensitive apps, include edge cases, not only clean success examples.

Monitoring And Tracing

Observability is the ability to understand and troubleshoot AI systems over their lifecycle. In Foundry, monitoring is integrated with Azure Monitor Application Insights, and tracing can show model calls, tool invocations, agent decisions, and dependencies. That is especially important for agents because a bad final answer might come from a wrong tool choice, a malformed tool input, a weak retrieved context, or an unsafe model output.

Monitoring is also a responsible AI control. If a helpdesk bot starts producing unsupported answers, a quality dashboard or groundedness trend can reveal drift. If a public app receives harmful prompts, safety metrics and content-filter events can show whether the policy is working. If latency rises, traces can show whether the bottleneck is retrieval, model inference, tool calls, or downstream services.

Cost-Aware Selection

Foundry and Foundry Tools do not produce one simple bill. Costs can come from model tokens, Azure Speech or Language transactions, Content Understanding analysis, Content Safety checks, Azure AI Search, storage, networking, Application Insights, and evaluations. Language, vision, and audio models may all use token-based metering, and input plus output both matter. Fine-tuned deployments can also create hosting cost while deployed.

Use the Azure pricing calculator before rollout, then use Cost Management to compare actual charges by resource and meter. Budgets and alerts help detect surprises, but they are not a substitute for understanding the architecture. If a team asks why a chatbot is expensive, inspect token usage, model choice, prompt length, retrieved context size, traffic volume, evaluation settings, and dependent resources.

Evaluation-To-Production Loop

Define the intended use, prohibited behavior, and success criteria.
Build representative test data with normal cases, edge cases, and abuse cases.
Run model or agent evaluations with quality, safety, and agent-specific evaluators.
Inspect failures, adjust prompts, tools, retrieval, guardrails, or service choice.
Deploy with monitoring, tracing, content safety, budgets, and human escalation.
Re-evaluate production samples on a schedule and tune thresholds from evidence.

For AI-901, remember the exam-level message: service selection is not finished when the model responds once. A production-minded answer measures quality, safety, cost, and behavior over time.

Test Your Knowledge

A team has a Foundry support agent that can look up orders and start refunds. Before production, the team wants to verify tool choice, grounded policy answers, and harmful-output handling. What should they do?

Run a Foundry evaluation against the agent using representative scenarios with agent, quality, and safety evaluators.

Increase temperature so the agent explores more possible refund policies.

Remove tracing because traces make multi-step behavior harder to understand.

Switch every tool call to image generation to make the answers more visual.

Test Your Knowledge

A prototype chatbot becomes more expensive after launch. Which investigation is most aligned with Foundry cost guidance?

Review token usage, model choice, service meters, dependent resources, and Cost Management views by resource or meter.

Assume all AI costs come only from the exam fee for AI-901.

Disable all monitoring because monitoring never has any cost.

Judge cost only by whether the HTTP response status was 200.

Up Next

Foundry Lab Sequence for AI-901

Chapter 5: Practice Labs, Common Traps, and Final Review

Microsoft Certified: Azure AI Fundamentals

Microsoft Certified: Azure AI Fundamentals (AI-901)

4.3 Evaluation, Monitoring, and Cost Awareness

Key Takeaways

Production Readiness Is Part Of Service Selection

What To Measure

Evaluation Targets

Monitoring And Tracing

Cost-Aware Selection

Evaluation-To-Production Loop

Microsoft Certified: Azure AI Fundamentals

1Chapter 1: AI-901 Format and Responsible AI

2Chapter 2: Microsoft Foundry, Models, and Agents

3Chapter 3: Azure AI Services, Vision, Language, and Extraction

4Chapter 4: AI-901 Scenario and Service Selection

5Chapter 5: Practice Labs, Common Traps, and Final Review

Microsoft Certified: Azure AI Fundamentals (AI-901)

4.3 Evaluation, Monitoring, and Cost Awareness

Key Takeaways

Production Readiness Is Part Of Service Selection

What To Measure

Evaluation Targets

Monitoring And Tracing

Cost-Aware Selection

Evaluation-To-Production Loop