2.2 Prompts, RAG, and Evaluation
Key Takeaways
- A prompt is the model input package, including user input, system instructions, retrieved context, examples, and formatting constraints where applicable.
- System messages set high-level behavior, boundaries, tone, and output expectations, while user prompts provide the current task.
- Retrieval-augmented generation grounds answers by retrieving relevant content, adding it to the prompt, and asking the model to generate from that context.
- Indexes improve RAG by making private or changing content searchable with keyword, semantic, vector, or hybrid retrieval.
- Foundry evaluations test model, agent, or dataset outputs for quality and safety, including RAG metrics such as retrieval quality, groundedness, relevance, and completeness.
Prompting Is Application Design
For AI-901, a prompt is not just the sentence a user types. It is the package of input sent to the model: system instructions, user input, retrieved context, examples, tool definitions, and output constraints. Good prompting makes the model's job explicit and makes the response easier to check.
A system message gives high-level direction. It can define the assistant's role, boundaries, style, refusal rules, citation expectations, and output format. A user prompt supplies the immediate task. Few-shot examples show the pattern the model should follow, while constraints can require JSON, a short answer, citations, or a specific tone.
Prompt Choices That Matter
| Choice | Exam meaning | Better answer in a scenario |
|---|---|---|
| System message | Sets durable role and behavior for a chat experience | Use it for boundaries, policy, tone, and output format |
| Few-shot examples | Demonstrate an input-output pattern | Use when the output structure is hard to describe in one instruction |
| Temperature | Controls variation in output | Lower it for factual or repeatable answers; raise cautiously for creative work |
| Max tokens | Caps generated length | Use to control cost and avoid overly long responses |
| Grounding context | Adds source content to the prompt | Use when answers must reflect private or current data |
The exam often rewards concrete prompts over vague prompts. A prompt that says "summarize this refund policy in three bullets and cite the policy section used" is stronger than "be helpful." Clear prompting does not guarantee correctness, but it reduces ambiguity and gives evaluation something specific to measure.
RAG: Retrieve, Augment, Generate
Retrieval-augmented generation (RAG) is the standard pattern when an app needs answers based on private, specialized, or frequently changing content. It does not change the model's weights. Instead, the application retrieves relevant content and inserts that content into the model input.
Use this flow:
- Prepare the content. Organize documents, split them into useful chunks, and keep metadata such as title, URL, file name, or security labels.
- Create or connect an index. Azure AI Search is a common index store. Retrieval can use keyword, semantic, vector, or hybrid search.
- Retrieve for the user question. The app finds passages that are likely to answer the current request.
- Augment the prompt. The app combines user input, system rules, and retrieved passages.
- Generate and cite. The model produces an answer that should stay within the provided grounding data and cite the sources when required.
This is different from fine-tuning. RAG is for fresh or private knowledge. Fine-tuning is for adapting behavior, style, or repeated task patterns using training examples. If the question says company policy changes every month, RAG is usually better. If the question says the model must learn a stable response style from many examples, fine-tuning may be relevant.
RAG Limits And Security
RAG improves factual grounding, but it is not magic. Poor chunking, weak embeddings, bad search settings, or irrelevant retrieved passages can still produce unsupported answers. Retrieved documents can also contain sensitive information or malicious instructions. A responsible RAG app applies access control at retrieval time, treats retrieved text as untrusted input, and uses system instructions that tell the model how to handle conflicting or suspicious document content.
RAG also consumes tokens. Retrieved passages increase input length, cost, and latency. A good app ranks and filters passages instead of dumping too much context into the prompt.
Evaluation Closes The Loop
Foundry evaluations test generative AI models and agents against datasets, conversations, traces, or synthetic scenarios. For a model, evaluation can measure output quality, safety, and task fit. For an agent, it can evaluate full conversations or individual turns.
RAG has two evaluation layers:
- Process evaluation checks retrieval. Are the returned chunks relevant? Did the index return the right documents? Should chunk size, top-k, semantic ranking, or vector settings change?
- System evaluation checks the final answer. Is it grounded in the provided context? Does it answer the query? Is it complete against a ground truth answer?
AI-901 does not require you to calculate metrics by hand. It expects you to know why evaluation is necessary. A fluent answer can still be false, incomplete, unsafe, or unsupported. The better production habit is to test prompts, RAG retrieval, groundedness, relevance, and safety before deploying and then monitor real use for regressions.
A benefits chatbot must answer from the company's current HR handbook and include a source reference for each policy answer. The handbook changes several times a year. Which design choice is most appropriate?
A RAG prototype gives poor answers, and logs show the final prompt contains passages from unrelated documents. Which investigation should come first?