5.4 Retrieval-Augmented Generation (RAG)
Key Takeaways
- RAG (Retrieval-Augmented Generation) enhances generative AI by retrieving relevant information from external data sources and including it in the prompt as grounding data.
- RAG solves the knowledge cutoff problem — the model can answer questions about your proprietary data and recent events that were not in its training data.
- The RAG pattern follows three steps: (1) retrieve relevant documents from a knowledge base, (2) include them in the prompt as context, (3) generate a grounded response.
- Azure AI Search is the primary Azure service for implementing RAG — it provides vector search, semantic ranking, and hybrid search for finding relevant documents.
- Embeddings convert text into numerical vectors that capture meaning, enabling semantic search (finding content by meaning, not just keywords).
Retrieval-Augmented Generation (RAG)
Quick Answer: RAG enhances generative AI by retrieving relevant documents from external sources and including them as context in the prompt. This grounds the model's response in factual data, reducing hallucinations and enabling answers about proprietary or recent information. Azure AI Search is the primary Azure service for implementing RAG.
What Is RAG?
Retrieval-Augmented Generation (RAG) is an architecture pattern that combines information retrieval with generative AI to produce more accurate, grounded, and up-to-date responses.
The Problem RAG Solves
Generative AI models have two fundamental limitations:
- Knowledge cutoff — they do not know about events after their training data was collected
- No proprietary data — they have no access to your organization's private documents, databases, or systems
RAG solves both problems by retrieving relevant information and injecting it into the prompt before the model generates a response.
RAG vs. Standard Generation
| Approach | How It Works | Accuracy | Data Access |
|---|---|---|---|
| Standard generation | Model responds from pre-trained knowledge only | Risk of hallucinations and outdated info | Only pre-training data |
| RAG | Retrieve relevant documents → include in prompt → generate grounded response | Higher accuracy, fewer hallucinations | Your data + pre-training |
How RAG Works
The Three-Step RAG Process
Step 1: Retrieve
- User asks a question
- The system searches a knowledge base (Azure AI Search) for relevant documents
- Retrieval can use keyword search, vector search, or hybrid search
- Top-K most relevant documents or passages are selected
Step 2: Augment
- Retrieved documents are added to the prompt as context
- The prompt now includes: system message + retrieved context + user question
- This "augments" the model's knowledge with specific, relevant data
Step 3: Generate
- The generative AI model produces a response grounded in the retrieved context
- The response is based on the provided documents, not just pre-trained knowledge
- This significantly reduces hallucinations
RAG Example
Without RAG: User: "What is our company's parental leave policy?" Model: "I'm sorry, I don't have information about your specific company's policies." (Or worse, it hallucmates a policy.)
With RAG:
- Retrieve: Search company HR documents → find parental_leave_policy.pdf
- Augment: Add policy document to prompt as context
- Generate: Model responds: "According to your company's policy, employees are entitled to 16 weeks of paid parental leave, applicable to both birth and adoptive parents..."
Embeddings and Vector Search
What Are Embeddings?
Embeddings are numerical vector representations of text that capture its semantic meaning. Similar meanings produce similar vectors, enabling semantic search.
| Text | Vector (simplified) | Similar To |
|---|---|---|
| "How to reset my password" | [0.12, 0.87, 0.34, ...] | "Change login credentials" |
| "What's the weather like" | [0.91, 0.05, 0.73, ...] | "Temperature forecast today" |
Why Embeddings Matter for RAG
Traditional keyword search matches exact words: searching "reset password" only finds documents containing those exact words.
Semantic search with embeddings matches MEANING: searching "reset password" also finds documents about "change credentials," "update login," or "recover account access" — even if they don't contain the exact words "reset" or "password."
Types of Search in Azure AI Search
| Search Type | How It Works | Best For |
|---|---|---|
| Keyword search | Matches exact terms (TF-IDF, BM25) | Finding specific terms and phrases |
| Vector search | Matches meaning using embeddings | Finding semantically similar content |
| Hybrid search | Combines keyword + vector search | Best overall retrieval quality |
| Semantic ranking | Re-ranks results using a language model | Improving result relevance |
Azure AI Search for RAG
Azure AI Search is the primary service for building RAG solutions on Azure:
Key Capabilities for RAG
| Feature | Description |
|---|---|
| Indexing | Ingest and index documents from Azure Blob, SQL, Cosmos DB, and more |
| Vector search | Store and search embedding vectors for semantic matching |
| Hybrid search | Combine keyword and vector search for best results |
| Semantic ranking | Use AI to re-rank results by relevance |
| AI enrichment | Apply AI skills during indexing (OCR, entity extraction, translation) |
| Integrated vectorization | Automatically generate embeddings during indexing and querying |
RAG Architecture on Azure
User Question
│
▼
[Azure AI Search] ──── retrieve relevant documents
│
▼
[Prompt Construction] ──── system message + retrieved docs + question
│
▼
[Azure OpenAI Service] ──── generate grounded response
│
▼
Grounded Answer
On the Exam: Know that RAG uses a retrieval step (Azure AI Search) to find relevant documents, then includes them in the prompt for the generative model (Azure OpenAI). This grounds responses in factual data and reduces hallucinations. You do NOT need to know how to implement RAG — just understand the concept and why it is important.
RAG vs. Fine-Tuning
| Aspect | RAG | Fine-Tuning |
|---|---|---|
| How it works | Retrieve relevant data at query time | Modify the model's weights with training data |
| Data freshness | Always uses latest data from knowledge base | Reflects data at time of fine-tuning |
| Cost | Search infrastructure costs | Training compute costs |
| Speed to implement | Fast (index documents, configure search) | Slow (prepare data, train, validate) |
| Best for | Dynamic data, Q&A over documents, current info | Consistent behavior change, style adaptation |
| Hallucination reduction | Strong (grounded in retrieved data) | Moderate (model still generates from weights) |
On the Exam: If a question asks about answering questions from company documents or providing up-to-date information, RAG is usually the correct answer. Fine-tuning is for changing the model's overall behavior or style, not for adding specific knowledge.
What is the primary purpose of RAG (Retrieval-Augmented Generation)?
Which Azure service is primarily used as the retrieval component in a RAG architecture?
What are "embeddings" in the context of generative AI?
A company wants employees to ask natural language questions about internal policies and receive accurate answers from their HR documents. Which approach is most appropriate?
Put the RAG process steps in the correct order:
Arrange the items in the correct order