6.3 Embeddings and Vector Search for RAG

Key Takeaways

Embeddings map text to dense vectors so that semantically similar text yields vectors that are close (high cosine similarity).
text-embedding-3-large = 3,072 dims, text-embedding-3-small = 1,536 dims, text-embedding-ada-002 = 1,536 dims (legacy); all accept up to ~8,191 tokens.
RAG flow: embed the query, run vector/hybrid search in Azure AI Search, inject top chunks as context, then have GPT generate a grounded, cited answer.
Chunking (256-1024 tokens with overlap) is mandatory both because of the embedding token limit and because smaller chunks give sharper retrieval.
Hybrid search (BM25 keyword + vector) merged with Reciprocal Rank Fusion and re-scored by semantic ranking is Microsoft's recommended retrieval design.

Last updated: June 2026

Quick Answer: Embeddings turn text into vectors whose cosine similarity measures meaning. RAG = embed query → vector/hybrid search in Azure AI Search → feed top chunks to GPT → generate a grounded answer with citations. Chunk documents (256-1024 tokens, with overlap) before embedding, and prefer hybrid search + semantic ranking.

Embedding Models and Dimensions

Model	Dimensions	Max tokens	Notes
text-embedding-3-large	3,072	~8,191	Highest quality; supports `dimensions` shortening
text-embedding-3-small	1,536	~8,191	Cheaper, strong quality
text-embedding-ada-002	1,536	~8,191	Legacy; still common

resp = client.embeddings.create(
    model="embed-large",            # deployment name
    input="What is machine learning?"
)
vec = resp.data[0].embedding       # len == 3072 for 3-large

On the Exam: The 3-series supports a dimensions parameter to truncate vectors (e.g. 3-large down to 1,024) trading a little accuracy for cheaper storage and faster search. The index vector field dimension must equal the embedding length you store — a mismatch is a classic failure scenario.

Why Chunk Before Embedding

Two hard reasons:

Token ceiling — an embedding call rejects input over ~8,191 tokens, so long docs must be split.
Retrieval precision — a 30-page PDF in one vector returns the whole document for any vaguely related query; small chunks let search surface only the relevant passage.

Strategy	Pro	Con
Fixed-size (N tokens)	Predictable	Cuts mid-sentence
Sentence/paragraph	Preserves meaning	Uneven sizes
Semantic (topic shift)	Best relevance	Most complex
Sliding window (overlap)	Keeps cross-chunk context	Redundant storage

Chunk size	Use
256-512 tokens	FAQ, fine-grained Q&A
512-1024 tokens	General RAG (default)
1024-2048 tokens	Long, context-heavy topics

Add 10-20% overlap so a fact spanning a boundary survives in at least one chunk.

How Similarity Search Works

Once text is embedded, finding relevant content is a nearest-neighbor problem in vector space. The query vector is compared against every stored chunk vector, and the closest ones — by cosine similarity — are returned. Because exact comparison across millions of vectors is expensive, Azure AI Search builds an approximate nearest neighbor (ANN) index using the HNSW (Hierarchical Navigable Small World) algorithm, trading a sliver of recall for dramatic speed. You configure the vector field with its dimension count, the distance metric, and HNSW parameters when you create the index.

A worked sizing example: indexing a 200-page manual at roughly 500 words (≈670 tokens) per chunk with 100-token overlap yields on the order of 150-200 chunks, hence 150-200 embedding calls and 150-200 vectors of 3,072 floats each to store. Doubling chunk size halves the vector count and storage but coarsens retrieval; halving it sharpens retrieval but multiplies storage and embedding cost. This is the central chunking trade-off the exam probes.

On the Exam: Remember that the index vector field dimension must equal the embedding length (3,072 for 3-large, 1,536 for 3-small/ada-002), and that the same embedding model must be used for both indexing and querying — mixing models or dimensions silently destroys relevance.

Building the Index and RAG Pipeline

# 1. Embed and index each chunk
for i, chunk in enumerate(chunk_text(doc, size=500, overlap=100)):
    emb = client.embeddings.create(model="embed-large", input=chunk).data[0].embedding
    search_client.upload_documents([{
        "id": f"doc1-{i}", "content": chunk, "contentVector": emb
    }])

# 2. Query time: embed, hybrid-search, then ground GPT
def rag(question):
    qvec = client.embeddings.create(model="embed-large", input=question).data[0].embedding
    results = search_client.search(
        search_text=question,                       # BM25 keyword leg
        vector_queries=[VectorizedQuery(vector=qvec, k_nearest_neighbors=5,
                                        fields="contentVector")],
        query_type="semantic",
        semantic_configuration_name="my-sem", top=5)
    context = "\n\n".join(f"[{r['id']}] {r['content']}" for r in results)
    return client.chat.completions.create(
        model="gpt4o-chat", temperature=0.2,
        messages=[
          {"role": "system", "content":
             "Answer ONLY from the context. If it is not there, say you do not "
             "have enough information. Cite the [id] of each source used."},
          {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
        ]).choices[0].message.content

Note the low temperature (0.2) and the explicit grounding instruction — together they suppress hallucination. The system prompt telling the model to admit ignorance is the single most exam-tested responsible-AI control in RAG.

How Hybrid Search Ranks

[Query]
 ├─ BM25 keyword search   → exact-term hits
 ├─ Vector search (cosine) → semantic neighbors (k-NN, HNSW index)
 └─ Reciprocal Rank Fusion → merge both result sets
     └─ Semantic ranker     → L2 re-rank with a language model
         └─ Top N to the LLM

Distance metrics matter too. Azure AI Search compares query and document vectors with cosine similarity by default (the metric the text-embedding-3 models are normalized for); dot-product and Euclidean are also available, but the metric configured on the index must match what your embedding model expects, or relevance collapses. This mismatch is a subtle but exam-worthy failure mode alongside the dimension mismatch noted earlier.

Tuning retrieval quality

Lever	Effect	Trade-off
Larger `k_nearest_neighbors`	More candidate chunks recalled	More tokens injected, higher cost
Smaller chunks + overlap	Sharper passage targeting	More vectors to store and search
Add semantic ranker	Better top-result ordering	Slight added latency and cost
Raise relevance threshold	Fewer off-topic chunks	Risk of dropping a needed passage

The practical RAG failure to remember: if answers are vague or wrong, the culprit is usually retrieval (bad chunks reaching the prompt), not the LLM. Diagnose by logging the retrieved context before blaming the model, then adjust chunking, search type, or top count.

On the Exam: Vector-only search misses exact identifiers (part numbers, names); keyword-only misses paraphrases. Hybrid (BM25 + vector) with semantic ranking is the recommended answer when a question asks for the best retrieval quality. Azure AI Search uses an HNSW graph for approximate nearest-neighbor vector search by default, and cosine similarity as the default distance metric for text-embedding-3 vectors.

Test Your Knowledge

How many dimensions does a text-embedding-3-large vector contain by default?

768

1,536

3,072

8,191

Test Your Knowledge

Which retrieval approach does Microsoft recommend for the highest-quality RAG results?

Keyword (BM25) search only

Vector search only

Hybrid search (keyword + vector) with semantic ranking

Exact-match filtering only

Test Your Knowledge

Why must large documents be split into chunks before generating embeddings?

Only because it makes the code shorter

Embedding models reject input over their token limit, and smaller chunks give more precise retrieval

Because Azure AI Search cannot store more than 256 documents

Chunking is optional and has no measurable effect

Test Your Knowledge

In a RAG system prompt, instructing the model to reply 'I don't have enough information' when context lacks the answer primarily serves to:

Shorten responses to save tokens

Prevent hallucination by keeping answers grounded in the supplied context

Comply with Azure billing terms

Force the model to switch languages

Up Next

6.4 Fine-Tuning and Model Customization

Continue learning

Azure AI Engineer Associate

Azure AI-102

6.3 Embeddings and Vector Search for RAG

Key Takeaways

Embedding Models and Dimensions

Why Chunk Before Embedding

How Similarity Search Works

Building the Index and RAG Pipeline

How Hybrid Search Ranks

Tuning retrieval quality

Azure AI Engineer Associate

1Introduction

2Domain 1: Plan and Manage an Azure AI Solution (20-25%)

3Content Safety and Moderation (within Plan and Manage, Domain 1)

4Domain 4: Implement Computer Vision Solutions (10-15%)

5Domain 5: Implement Natural Language Processing Solutions (15-20%)

6Domain 6: Implement Knowledge Mining and Information Extraction Solutions (15-20%)

7Domain 2: Implement Generative AI Solutions (15-20%)

8Domain 3: Implement an Agentic Solution (5-10%)

9Exam Review: Cross-Domain Topics and Advanced Practice

Azure AI-102

6.3 Embeddings and Vector Search for RAG

Key Takeaways

Embedding Models and Dimensions

Why Chunk Before Embedding

How Similarity Search Works

Building the Index and RAG Pipeline

How Hybrid Search Ranks

Tuning retrieval quality