6.3 Embeddings and Vector Search for RAG

Key Takeaways

  • Embeddings convert text into dense numerical vectors that capture semantic meaning — similar texts produce similar vectors.
  • Azure OpenAI embedding models: text-embedding-3-large (3,072 dimensions), text-embedding-3-small (1,536 dimensions), and text-embedding-ada-002 (1,536 dimensions, legacy).
  • The RAG pattern: embed a user query → search for similar vectors in Azure AI Search → pass retrieved documents as context to GPT → generate a grounded response.
  • Chunking strategies split large documents into smaller segments before embedding — chunk size affects retrieval precision and context relevance.
  • Hybrid search combines keyword (BM25) and vector search, with optional semantic ranking, for the most comprehensive retrieval results.
Last updated: March 2026

Embeddings and Vector Search for RAG

Quick Answer: Embeddings convert text to vectors that capture meaning. The RAG pattern: embed query → vector search in AI Search → pass results as context to GPT → generate grounded response. Use hybrid search (keyword + vector + semantic ranking) for best results. Chunk documents before embedding.

Generating Embeddings

from openai import AzureOpenAI

client = AzureOpenAI(
    api_key="<your-key>",
    api_version="2024-06-01",
    azure_endpoint="https://my-openai.openai.azure.com/"
)

# Generate an embedding
response = client.embeddings.create(
    model="text-embedding-3-large-deployment",
    input="What is machine learning?"
)

embedding = response.data[0].embedding
print(f"Dimensions: {len(embedding)}")  # 3072 for text-embedding-3-large
print(f"First 5 values: {embedding[:5]}")

Embedding Models Comparison

ModelDimensionsMax TokensCostUse Case
text-embedding-3-large3,0728,191HigherBest quality, large-scale RAG
text-embedding-3-small1,5368,191LowerCost-effective, good quality
text-embedding-ada-0021,5368,191MediumLegacy, still widely used

Chunking Strategies

Large documents must be split into chunks before embedding:

Strategies

StrategyDescriptionProsCons
Fixed-sizeSplit every N characters/tokensSimple, predictableMay cut mid-sentence
Sentence-basedSplit on sentence boundariesPreserves meaningUneven chunk sizes
Paragraph-basedSplit on paragraph breaksNatural boundariesVery uneven sizes
SemanticUse NLP to split on topic shiftsBest relevanceMost complex
Sliding windowOverlapping chunksPreserves contextRedundant storage

Chunk Size Recommendations

Chunk SizeWhen to Use
256-512 tokensFine-grained retrieval, FAQ-style Q&A
512-1024 tokensGeneral-purpose RAG (most common)
1024-2048 tokensComplex topics requiring more context

Implementing Chunking

def chunk_text(text, chunk_size=500, overlap=100):
    """Split text into overlapping chunks."""
    words = text.split()
    chunks = []
    start = 0

    while start < len(words):
        end = start + chunk_size
        chunk = " ".join(words[start:end])
        chunks.append(chunk)
        start = end - overlap  # Overlap with previous chunk

    return chunks

# Chunk a document
chunks = chunk_text(document_text, chunk_size=500, overlap=100)

# Embed each chunk
for i, chunk in enumerate(chunks):
    embedding = client.embeddings.create(
        model="text-embedding-3-large-deployment",
        input=chunk
    ).data[0].embedding

    # Index chunk + embedding in Azure AI Search
    search_client.upload_documents([{
        "id": f"doc1-chunk{i}",
        "content": chunk,
        "contentVector": embedding
    }])

End-to-End RAG Implementation

def rag_query(user_question, search_client, openai_client):
    """Complete RAG pipeline: embed → search → generate."""

    # Step 1: Embed the user question
    query_embedding = openai_client.embeddings.create(
        model="text-embedding-3-large-deployment",
        input=user_question
    ).data[0].embedding

    # Step 2: Hybrid search (keyword + vector + semantic)
    results = search_client.search(
        search_text=user_question,  # Keyword search
        vector_queries=[
            VectorizedQuery(
                vector=query_embedding,
                k_nearest_neighbors=5,
                fields="contentVector"
            )
        ],
        query_type="semantic",
        semantic_configuration_name="my-semantic-config",
        top=5
    )

    # Step 3: Build context from search results
    context = "\n\n".join([
        f"Source: {r['title']}\n{r['content']}"
        for r in results
    ])

    # Step 4: Generate grounded response
    response = openai_client.chat.completions.create(
        model="gpt4o-deployment",
        messages=[
            {
                "role": "system",
                "content": """You are a helpful assistant that answers questions
based ONLY on the provided context. If the context does not contain
the answer, say "I don't have enough information to answer that question."
Always cite the source document when possible."""
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {user_question}"
            }
        ],
        temperature=0.3  # Low temperature for factual responses
    )

    return response.choices[0].message.content

Hybrid Search Configuration

Hybrid search combines multiple search methods:

[User Query]
    ├── [Keyword Search (BM25)]     → Results ranked by term frequency
    ├── [Vector Search (Cosine)]    → Results ranked by semantic similarity
    └── [Reciprocal Rank Fusion]    → Merge and re-rank combined results
        └── [Semantic Ranking]      → Final re-ranking by language model
            └── [Top Results]       → Returned to the application

On the Exam: Hybrid search (keyword + vector) with semantic ranking is the recommended approach for RAG applications. Vector-only search may miss keyword matches; keyword-only search misses semantic matches. Hybrid with semantic ranking provides the best of both worlds.

On Your Data Feature

Azure OpenAI "On Your Data" provides a managed RAG experience:

  1. Connect your Azure AI Search index to Azure OpenAI
  2. Queries automatically retrieve relevant documents from the index
  3. Retrieved documents are injected into the prompt as context
  4. GPT generates a grounded response with citations
  5. No custom code required for basic RAG scenarios
# Using Azure OpenAI "On Your Data"
response = client.chat.completions.create(
    model="gpt4o-deployment",
    messages=[
        {"role": "user", "content": "What is our company's return policy?"}
    ],
    extra_body={
        "data_sources": [{
            "type": "azure_search",
            "parameters": {
                "endpoint": "https://my-search.search.windows.net",
                "index_name": "company-docs",
                "authentication": {
                    "type": "api_key",
                    "key": "<search-key>"
                }
            }
        }]
    }
)
Test Your Knowledge

What is the primary purpose of text embeddings in RAG?

A
B
C
D
Test Your Knowledge

Which search approach provides the best results for RAG applications?

A
B
C
D
Test Your Knowledge

Why is document chunking important before generating embeddings?

A
B
C
D
Test Your Knowledge

In the RAG system prompt, why should you instruct the model to say "I don't have enough information" when the context doesn't contain an answer?

A
B
C
D