What is the primary purpose of text embeddings in RAG?

To convert text into numerical vectors that capture semantic meaning for similarity search. Text embeddings convert text into dense numerical vectors (arrays of floating-point numbers) that capture semantic meaning. Similar texts produce vectors that are close together in vector space. This enables finding semantically relevant documents through vector similarity search, which is the foundation of RAG.

Which search approach provides the best results for RAG applications?

Hybrid search (keyword + vector) with semantic ranking. Hybrid search combining keyword (BM25) and vector search with semantic ranking provides the best results for RAG. Keyword search catches exact term matches, vector search catches semantic matches, and semantic ranking re-ranks results using a language model. Together, they provide comprehensive retrieval.

Why is document chunking important before generating embeddings?

Both A and B. Chunking is important for two reasons: (1) embedding models have a maximum token limit (typically 8,191 tokens) so large documents must be split, and (2) smaller chunks produce more precise retrieval — when a query matches a specific section, only the relevant chunk is retrieved, not the entire document.

In the RAG system prompt, why should you instruct the model to say "I don't have enough information" when the context doesn't contain an answer?

To prevent the model from hallucinating (generating ungrounded responses). Instructing the model to admit when it lacks information prevents hallucination — generating plausible-sounding but incorrect information not supported by the provided context. This is a critical responsible AI practice for RAG applications, ensuring responses are grounded in actual source data.

Embeddings and Vector Search for RAG

Quick Answer: Embeddings convert text to vectors that capture meaning. The RAG pattern: embed query → vector search in AI Search → pass results as context to GPT → generate grounded response. Use hybrid search (keyword + vector + semantic ranking) for best results. Chunk documents before embedding.

Generating Embeddings

from openai import AzureOpenAI

client = AzureOpenAI(
    api_key="<your-key>",
    api_version="2024-06-01",
    azure_endpoint="https://my-openai.openai.azure.com/"
)

# Generate an embedding
response = client.embeddings.create(
    model="text-embedding-3-large-deployment",
    input="What is machine learning?"
)

embedding = response.data[0].embedding
print(f"Dimensions: {len(embedding)}")  # 3072 for text-embedding-3-large
print(f"First 5 values: {embedding[:5]}")

Embedding Models Comparison

Model	Dimensions	Max Tokens	Cost	Use Case
text-embedding-3-large	3,072	8,191	Higher	Best quality, large-scale RAG
text-embedding-3-small	1,536	8,191	Lower	Cost-effective, good quality
text-embedding-ada-002	1,536	8,191	Medium	Legacy, still widely used

Chunking Strategies

Large documents must be split into chunks before embedding:

Strategies

Strategy	Description	Pros	Cons
Fixed-size	Split every N characters/tokens	Simple, predictable	May cut mid-sentence
Sentence-based	Split on sentence boundaries	Preserves meaning	Uneven chunk sizes
Paragraph-based	Split on paragraph breaks	Natural boundaries	Very uneven sizes
Semantic	Use NLP to split on topic shifts	Best relevance	Most complex
Sliding window	Overlapping chunks	Preserves context	Redundant storage

Chunk Size Recommendations

Chunk Size	When to Use
256-512 tokens	Fine-grained retrieval, FAQ-style Q&A
512-1024 tokens	General-purpose RAG (most common)
1024-2048 tokens	Complex topics requiring more context

Implementing Chunking

def chunk_text(text, chunk_size=500, overlap=100):
    """Split text into overlapping chunks."""
    words = text.split()
    chunks = []
    start = 0

    while start < len(words):
        end = start + chunk_size
        chunk = " ".join(words[start:end])
        chunks.append(chunk)
        start = end - overlap  # Overlap with previous chunk

    return chunks

# Chunk a document
chunks = chunk_text(document_text, chunk_size=500, overlap=100)

# Embed each chunk
for i, chunk in enumerate(chunks):
    embedding = client.embeddings.create(
        model="text-embedding-3-large-deployment",
        input=chunk
    ).data[0].embedding

    # Index chunk + embedding in Azure AI Search
    search_client.upload_documents([{
        "id": f"doc1-chunk{i}",
        "content": chunk,
        "contentVector": embedding
    }])

End-to-End RAG Implementation

def rag_query(user_question, search_client, openai_client):
    """Complete RAG pipeline: embed → search → generate."""

    # Step 1: Embed the user question
    query_embedding = openai_client.embeddings.create(
        model="text-embedding-3-large-deployment",
        input=user_question
    ).data[0].embedding

    # Step 2: Hybrid search (keyword + vector + semantic)
    results = search_client.search(
        search_text=user_question,  # Keyword search
        vector_queries=[
            VectorizedQuery(
                vector=query_embedding,
                k_nearest_neighbors=5,
                fields="contentVector"
            )
        ],
        query_type="semantic",
        semantic_configuration_name="my-semantic-config",
        top=5
    )

    # Step 3: Build context from search results
    context = "\n\n".join([
        f"Source: {r['title']}\n{r['content']}"
        for r in results
    ])

    # Step 4: Generate grounded response
    response = openai_client.chat.completions.create(
        model="gpt4o-deployment",
        messages=[
            {
                "role": "system",
                "content": """You are a helpful assistant that answers questions
based ONLY on the provided context. If the context does not contain
the answer, say "I don't have enough information to answer that question."
Always cite the source document when possible."""
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {user_question}"
            }
        ],
        temperature=0.3  # Low temperature for factual responses
    )

    return response.choices[0].message.content

Hybrid Search Configuration

Hybrid search combines multiple search methods:

[User Query]
    ├── [Keyword Search (BM25)]     → Results ranked by term frequency
    ├── [Vector Search (Cosine)]    → Results ranked by semantic similarity
    └── [Reciprocal Rank Fusion]    → Merge and re-rank combined results
        └── [Semantic Ranking]      → Final re-ranking by language model
            └── [Top Results]       → Returned to the application

On the Exam: Hybrid search (keyword + vector) with semantic ranking is the recommended approach for RAG applications. Vector-only search may miss keyword matches; keyword-only search misses semantic matches. Hybrid with semantic ranking provides the best of both worlds.

On Your Data Feature

Azure OpenAI "On Your Data" provides a managed RAG experience:

Connect your Azure AI Search index to Azure OpenAI
Queries automatically retrieve relevant documents from the index
Retrieved documents are injected into the prompt as context
GPT generates a grounded response with citations
No custom code required for basic RAG scenarios

# Using Azure OpenAI "On Your Data"
response = client.chat.completions.create(
    model="gpt4o-deployment",
    messages=[
        {"role": "user", "content": "What is our company's return policy?"}
    ],
    extra_body={
        "data_sources": [{
            "type": "azure_search",
            "parameters": {
                "endpoint": "https://my-search.search.windows.net",
                "index_name": "company-docs",
                "authentication": {
                    "type": "api_key",
                    "key": "<search-key>"
                }
            }
        }]
    }
)

Azure AI Engineer Associate

6.3 Embeddings and Vector Search for RAG

Key Takeaways

Embeddings and Vector Search for RAG

Generating Embeddings

Embedding Models Comparison

Chunking Strategies

Strategies

Chunk Size Recommendations

Implementing Chunking

End-to-End RAG Implementation

Hybrid Search Configuration

On Your Data Feature

Azure AI Engineer Associate

1Introduction

2Domain 1: Plan and Manage an Azure AI Solution (15-20%)

3Domain 2: Implement Content Moderation Solutions (10-15%)

4Domain 3: Implement Computer Vision Solutions (15-20%)

5Domain 4: Implement Natural Language Processing Solutions (25-30%)

6Domain 5: Implement Knowledge Mining and Document Intelligence Solutions (10-15%)

7Domain 6: Implement Generative AI Solutions (10-15%)

8Exam Review: Cross-Domain Topics and Advanced Practice

6.3 Embeddings and Vector Search for RAG

Key Takeaways

Embeddings and Vector Search for RAG

Generating Embeddings

Embedding Models Comparison

Chunking Strategies

Strategies

Chunk Size Recommendations

Implementing Chunking

End-to-End RAG Implementation

Hybrid Search Configuration

On Your Data Feature