2.2 Prompts, RAG, and Evaluation

Key Takeaways

A prompt is the model input package, including user input, system instructions, retrieved context, examples, and formatting constraints where applicable.
System messages set high-level behavior, boundaries, tone, and output expectations, while user prompts provide the current task.
Retrieval-augmented generation grounds answers by retrieving relevant content, adding it to the prompt, and asking the model to generate from that context.
Indexes improve RAG by making private or changing content searchable with keyword, semantic, vector, or hybrid retrieval.
Foundry evaluations test model, agent, or dataset outputs for quality and safety, including RAG metrics such as retrieval quality, groundedness, relevance, and completeness.

Last updated: June 2026

Prompting Is Application Design

For AI-901, a prompt is not just the sentence a user types. It is the package of input sent to the model: system instructions, user input, retrieved context, examples, tool definitions, and output constraints. Good prompting makes the model's job explicit and makes the response easier to check.

A system message gives high-level direction. It can define the assistant's role, boundaries, style, refusal rules, citation expectations, and output format. A user prompt supplies the immediate task. Few-shot examples show the pattern the model should follow, while constraints can require JSON, a short answer, citations, or a specific tone.

Prompt Choices That Matter

Choice	Exam meaning	Better answer in a scenario
System message	Sets durable role and behavior for a chat experience	Use it for boundaries, policy, tone, and output format
Few-shot examples	Demonstrate an input-output pattern	Use when the output structure is hard to describe in one instruction
Temperature	Controls variation in output	Lower it for factual or repeatable answers; raise cautiously for creative work
Max tokens	Caps generated length	Use to control cost and avoid overly long responses
Grounding context	Adds source content to the prompt	Use when answers must reflect private or current data

The exam often rewards concrete prompts over vague prompts. A prompt that says "summarize this refund policy in three bullets and cite the policy section used" is stronger than "be helpful." Clear prompting does not guarantee correctness, but it reduces ambiguity and gives evaluation something specific to measure.

RAG: Retrieve, Augment, Generate

Retrieval-augmented generation (RAG) is the standard pattern when an app needs answers based on private, specialized, or frequently changing content. It does not change the model's weights. Instead, the application retrieves relevant content and inserts that content into the model input.

Use this flow:

Prepare the content. Organize documents, split them into useful chunks, and keep metadata such as title, URL, file name, or security labels.
Create or connect an index. Azure AI Search is a common index store. Retrieval can use keyword, semantic, vector, or hybrid search.
Retrieve for the user question. The app finds passages that are likely to answer the current request.
Augment the prompt. The app combines user input, system rules, and retrieved passages.
Generate and cite. The model produces an answer that should stay within the provided grounding data and cite the sources when required.

This is different from fine-tuning. RAG is for fresh or private knowledge. Fine-tuning is for adapting behavior, style, or repeated task patterns using training examples. If the question says company policy changes every month, RAG is usually better. If the question says the model must learn a stable response style from many examples, fine-tuning may be relevant.

RAG Limits And Security

RAG improves factual grounding, but it is not magic. Poor chunking, weak embeddings, bad search settings, or irrelevant retrieved passages can still produce unsupported answers. Retrieved documents can also contain sensitive information or malicious instructions. A responsible RAG app applies access control at retrieval time, treats retrieved text as untrusted input, and uses system instructions that tell the model how to handle conflicting or suspicious document content.

RAG also consumes tokens. Retrieved passages increase input length, cost, and latency. A good app ranks and filters passages instead of dumping too much context into the prompt.

Evaluation Closes The Loop

Foundry evaluations test generative AI models and agents against datasets, conversations, traces, or synthetic scenarios. For a model, evaluation can measure output quality, safety, and task fit. For an agent, it can evaluate full conversations or individual turns.

RAG has two evaluation layers:

Process evaluation checks retrieval. Are the returned chunks relevant? Did the index return the right documents? Should chunk size, top-k, semantic ranking, or vector settings change?
System evaluation checks the final answer. Is it grounded in the provided context? Does it answer the query? Is it complete against a ground truth answer?

AI-901 does not require you to calculate metrics by hand. It expects you to know why evaluation is necessary. A fluent answer can still be false, incomplete, unsafe, or unsupported. The better production habit is to test prompts, RAG retrieval, groundedness, relevance, and safety before deploying and then monitor real use for regressions.

Test Your Knowledge

A benefits chatbot must answer from the company's current HR handbook and include a source reference for each policy answer. The handbook changes several times a year. Which design choice is most appropriate?

Use RAG with an indexed handbook so retrieved passages ground each answer.

Rely only on the base model's training data because public models know all company policies.

Fine-tune once and never update the content again.

Increase temperature so the chatbot produces more varied policy answers.

Test Your Knowledge

A RAG prototype gives poor answers, and logs show the final prompt contains passages from unrelated documents. Which investigation should come first?

Evaluate and tune the retrieval step, including chunking, search mode, ranking, and top-k settings.

Increase the response length limit so the model has more room to guess.

Remove all source citations because citations make answers less reliable.

Replace every system message with an empty prompt.

Up Next

2.3 Foundry Agent Service

Continue learning

Microsoft Certified: Azure AI Fundamentals

Microsoft Certified: Azure AI Fundamentals (AI-901)

2.2 Prompts, RAG, and Evaluation

Key Takeaways

Prompting Is Application Design

Prompt Choices That Matter

RAG: Retrieve, Augment, Generate

RAG Limits And Security

Evaluation Closes The Loop

Microsoft Certified: Azure AI Fundamentals

1Chapter 1: AI-901 Format and Responsible AI

2Chapter 2: Microsoft Foundry, Models, and Agents

3Chapter 3: Azure AI Services, Vision, Language, and Extraction

4Chapter 4: AI-901 Scenario and Service Selection

5Chapter 5: Practice Labs, Common Traps, and Final Review

Microsoft Certified: Azure AI Fundamentals (AI-901)

2.2 Prompts, RAG, and Evaluation

Key Takeaways

Prompting Is Application Design

Prompt Choices That Matter

RAG: Retrieve, Augment, Generate

RAG Limits And Security

Evaluation Closes The Loop