5.4 Retrieval-Augmented Generation (RAG)

Key Takeaways

  • RAG (Retrieval-Augmented Generation) enhances generative AI by retrieving relevant information from external data sources and including it in the prompt as grounding data.
  • RAG solves the knowledge cutoff problem — the model can answer questions about your proprietary data and recent events that were not in its training data.
  • The RAG pattern follows three steps: (1) retrieve relevant documents from a knowledge base, (2) include them in the prompt as context, (3) generate a grounded response.
  • Azure AI Search is the primary Azure service for implementing RAG — it provides vector search, semantic ranking, and hybrid search for finding relevant documents.
  • Embeddings convert text into numerical vectors that capture meaning, enabling semantic search (finding content by meaning, not just keywords).
Last updated: March 2026

Retrieval-Augmented Generation (RAG)

Quick Answer: RAG enhances generative AI by retrieving relevant documents from external sources and including them as context in the prompt. This grounds the model's response in factual data, reducing hallucinations and enabling answers about proprietary or recent information. Azure AI Search is the primary Azure service for implementing RAG.

What Is RAG?

Retrieval-Augmented Generation (RAG) is an architecture pattern that combines information retrieval with generative AI to produce more accurate, grounded, and up-to-date responses.

The Problem RAG Solves

Generative AI models have two fundamental limitations:

  1. Knowledge cutoff — they do not know about events after their training data was collected
  2. No proprietary data — they have no access to your organization's private documents, databases, or systems

RAG solves both problems by retrieving relevant information and injecting it into the prompt before the model generates a response.

RAG vs. Standard Generation

ApproachHow It WorksAccuracyData Access
Standard generationModel responds from pre-trained knowledge onlyRisk of hallucinations and outdated infoOnly pre-training data
RAGRetrieve relevant documents → include in prompt → generate grounded responseHigher accuracy, fewer hallucinationsYour data + pre-training

How RAG Works

The Three-Step RAG Process

Step 1: Retrieve

  • User asks a question
  • The system searches a knowledge base (Azure AI Search) for relevant documents
  • Retrieval can use keyword search, vector search, or hybrid search
  • Top-K most relevant documents or passages are selected

Step 2: Augment

  • Retrieved documents are added to the prompt as context
  • The prompt now includes: system message + retrieved context + user question
  • This "augments" the model's knowledge with specific, relevant data

Step 3: Generate

  • The generative AI model produces a response grounded in the retrieved context
  • The response is based on the provided documents, not just pre-trained knowledge
  • This significantly reduces hallucinations

RAG Example

Without RAG: User: "What is our company's parental leave policy?" Model: "I'm sorry, I don't have information about your specific company's policies." (Or worse, it hallucmates a policy.)

With RAG:

  1. Retrieve: Search company HR documents → find parental_leave_policy.pdf
  2. Augment: Add policy document to prompt as context
  3. Generate: Model responds: "According to your company's policy, employees are entitled to 16 weeks of paid parental leave, applicable to both birth and adoptive parents..."

Embeddings and Vector Search

What Are Embeddings?

Embeddings are numerical vector representations of text that capture its semantic meaning. Similar meanings produce similar vectors, enabling semantic search.

TextVector (simplified)Similar To
"How to reset my password"[0.12, 0.87, 0.34, ...]"Change login credentials"
"What's the weather like"[0.91, 0.05, 0.73, ...]"Temperature forecast today"

Why Embeddings Matter for RAG

Traditional keyword search matches exact words: searching "reset password" only finds documents containing those exact words.

Semantic search with embeddings matches MEANING: searching "reset password" also finds documents about "change credentials," "update login," or "recover account access" — even if they don't contain the exact words "reset" or "password."

Types of Search in Azure AI Search

Search TypeHow It WorksBest For
Keyword searchMatches exact terms (TF-IDF, BM25)Finding specific terms and phrases
Vector searchMatches meaning using embeddingsFinding semantically similar content
Hybrid searchCombines keyword + vector searchBest overall retrieval quality
Semantic rankingRe-ranks results using a language modelImproving result relevance

Azure AI Search for RAG

Azure AI Search is the primary service for building RAG solutions on Azure:

Key Capabilities for RAG

FeatureDescription
IndexingIngest and index documents from Azure Blob, SQL, Cosmos DB, and more
Vector searchStore and search embedding vectors for semantic matching
Hybrid searchCombine keyword and vector search for best results
Semantic rankingUse AI to re-rank results by relevance
AI enrichmentApply AI skills during indexing (OCR, entity extraction, translation)
Integrated vectorizationAutomatically generate embeddings during indexing and querying

RAG Architecture on Azure

User Question
     │
     ▼
[Azure AI Search]  ──── retrieve relevant documents
     │
     ▼
[Prompt Construction] ──── system message + retrieved docs + question
     │
     ▼
[Azure OpenAI Service] ──── generate grounded response
     │
     ▼
Grounded Answer

On the Exam: Know that RAG uses a retrieval step (Azure AI Search) to find relevant documents, then includes them in the prompt for the generative model (Azure OpenAI). This grounds responses in factual data and reduces hallucinations. You do NOT need to know how to implement RAG — just understand the concept and why it is important.

RAG vs. Fine-Tuning

AspectRAGFine-Tuning
How it worksRetrieve relevant data at query timeModify the model's weights with training data
Data freshnessAlways uses latest data from knowledge baseReflects data at time of fine-tuning
CostSearch infrastructure costsTraining compute costs
Speed to implementFast (index documents, configure search)Slow (prepare data, train, validate)
Best forDynamic data, Q&A over documents, current infoConsistent behavior change, style adaptation
Hallucination reductionStrong (grounded in retrieved data)Moderate (model still generates from weights)

On the Exam: If a question asks about answering questions from company documents or providing up-to-date information, RAG is usually the correct answer. Fine-tuning is for changing the model's overall behavior or style, not for adding specific knowledge.

Test Your Knowledge

What is the primary purpose of RAG (Retrieval-Augmented Generation)?

A
B
C
D
Test Your Knowledge

Which Azure service is primarily used as the retrieval component in a RAG architecture?

A
B
C
D
Test Your Knowledge

What are "embeddings" in the context of generative AI?

A
B
C
D
Test Your Knowledge

A company wants employees to ask natural language questions about internal policies and receive accurate answers from their HR documents. Which approach is most appropriate?

A
B
C
D
Test Your KnowledgeOrdering

Put the RAG process steps in the correct order:

Arrange the items in the correct order

1
Generate a grounded response using the generative AI model
2
Retrieve relevant documents from the knowledge base
3
User asks a question
4
Include retrieved documents in the prompt as context