6.3 Embeddings and Vector Search for RAG
Key Takeaways
- Embeddings map text to dense vectors so that semantically similar text yields vectors that are close (high cosine similarity).
- text-embedding-3-large = 3,072 dims, text-embedding-3-small = 1,536 dims, text-embedding-ada-002 = 1,536 dims (legacy); all accept up to ~8,191 tokens.
- RAG flow: embed the query, run vector/hybrid search in Azure AI Search, inject top chunks as context, then have GPT generate a grounded, cited answer.
- Chunking (256-1024 tokens with overlap) is mandatory both because of the embedding token limit and because smaller chunks give sharper retrieval.
- Hybrid search (BM25 keyword + vector) merged with Reciprocal Rank Fusion and re-scored by semantic ranking is Microsoft's recommended retrieval design.
Quick Answer: Embeddings turn text into vectors whose cosine similarity measures meaning. RAG = embed query → vector/hybrid search in Azure AI Search → feed top chunks to GPT → generate a grounded answer with citations. Chunk documents (256-1024 tokens, with overlap) before embedding, and prefer hybrid search + semantic ranking.
Embedding Models and Dimensions
| Model | Dimensions | Max tokens | Notes |
|---|---|---|---|
| text-embedding-3-large | 3,072 | ~8,191 | Highest quality; supports dimensions shortening |
| text-embedding-3-small | 1,536 | ~8,191 | Cheaper, strong quality |
| text-embedding-ada-002 | 1,536 | ~8,191 | Legacy; still common |
resp = client.embeddings.create(
model="embed-large", # deployment name
input="What is machine learning?"
)
vec = resp.data[0].embedding # len == 3072 for 3-large
On the Exam: The 3-series supports a
dimensionsparameter to truncate vectors (e.g. 3-large down to 1,024) trading a little accuracy for cheaper storage and faster search. The index vector field dimension must equal the embedding length you store — a mismatch is a classic failure scenario.
Why Chunk Before Embedding
Two hard reasons:
- Token ceiling — an embedding call rejects input over ~8,191 tokens, so long docs must be split.
- Retrieval precision — a 30-page PDF in one vector returns the whole document for any vaguely related query; small chunks let search surface only the relevant passage.
| Strategy | Pro | Con |
|---|---|---|
| Fixed-size (N tokens) | Predictable | Cuts mid-sentence |
| Sentence/paragraph | Preserves meaning | Uneven sizes |
| Semantic (topic shift) | Best relevance | Most complex |
| Sliding window (overlap) | Keeps cross-chunk context | Redundant storage |
| Chunk size | Use |
|---|---|
| 256-512 tokens | FAQ, fine-grained Q&A |
| 512-1024 tokens | General RAG (default) |
| 1024-2048 tokens | Long, context-heavy topics |
Add 10-20% overlap so a fact spanning a boundary survives in at least one chunk.
How Similarity Search Works
Once text is embedded, finding relevant content is a nearest-neighbor problem in vector space. The query vector is compared against every stored chunk vector, and the closest ones — by cosine similarity — are returned. Because exact comparison across millions of vectors is expensive, Azure AI Search builds an approximate nearest neighbor (ANN) index using the HNSW (Hierarchical Navigable Small World) algorithm, trading a sliver of recall for dramatic speed. You configure the vector field with its dimension count, the distance metric, and HNSW parameters when you create the index.
A worked sizing example: indexing a 200-page manual at roughly 500 words (≈670 tokens) per chunk with 100-token overlap yields on the order of 150-200 chunks, hence 150-200 embedding calls and 150-200 vectors of 3,072 floats each to store. Doubling chunk size halves the vector count and storage but coarsens retrieval; halving it sharpens retrieval but multiplies storage and embedding cost. This is the central chunking trade-off the exam probes.
On the Exam: Remember that the index vector field dimension must equal the embedding length (3,072 for 3-large, 1,536 for 3-small/ada-002), and that the same embedding model must be used for both indexing and querying — mixing models or dimensions silently destroys relevance.
Building the Index and RAG Pipeline
# 1. Embed and index each chunk
for i, chunk in enumerate(chunk_text(doc, size=500, overlap=100)):
emb = client.embeddings.create(model="embed-large", input=chunk).data[0].embedding
search_client.upload_documents([{
"id": f"doc1-{i}", "content": chunk, "contentVector": emb
}])
# 2. Query time: embed, hybrid-search, then ground GPT
def rag(question):
qvec = client.embeddings.create(model="embed-large", input=question).data[0].embedding
results = search_client.search(
search_text=question, # BM25 keyword leg
vector_queries=[VectorizedQuery(vector=qvec, k_nearest_neighbors=5,
fields="contentVector")],
query_type="semantic",
semantic_configuration_name="my-sem", top=5)
context = "\n\n".join(f"[{r['id']}] {r['content']}" for r in results)
return client.chat.completions.create(
model="gpt4o-chat", temperature=0.2,
messages=[
{"role": "system", "content":
"Answer ONLY from the context. If it is not there, say you do not "
"have enough information. Cite the [id] of each source used."},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
]).choices[0].message.content
Note the low temperature (0.2) and the explicit grounding instruction — together they suppress hallucination. The system prompt telling the model to admit ignorance is the single most exam-tested responsible-AI control in RAG.
How Hybrid Search Ranks
[Query]
├─ BM25 keyword search → exact-term hits
├─ Vector search (cosine) → semantic neighbors (k-NN, HNSW index)
└─ Reciprocal Rank Fusion → merge both result sets
└─ Semantic ranker → L2 re-rank with a language model
└─ Top N to the LLM
Distance metrics matter too. Azure AI Search compares query and document vectors with cosine similarity by default (the metric the text-embedding-3 models are normalized for); dot-product and Euclidean are also available, but the metric configured on the index must match what your embedding model expects, or relevance collapses. This mismatch is a subtle but exam-worthy failure mode alongside the dimension mismatch noted earlier.
Tuning retrieval quality
| Lever | Effect | Trade-off |
|---|---|---|
Larger k_nearest_neighbors | More candidate chunks recalled | More tokens injected, higher cost |
| Smaller chunks + overlap | Sharper passage targeting | More vectors to store and search |
| Add semantic ranker | Better top-result ordering | Slight added latency and cost |
| Raise relevance threshold | Fewer off-topic chunks | Risk of dropping a needed passage |
The practical RAG failure to remember: if answers are vague or wrong, the culprit is usually retrieval (bad chunks reaching the prompt), not the LLM. Diagnose by logging the retrieved context before blaming the model, then adjust chunking, search type, or top count.
On the Exam: Vector-only search misses exact identifiers (part numbers, names); keyword-only misses paraphrases. Hybrid (BM25 + vector) with semantic ranking is the recommended answer when a question asks for the best retrieval quality. Azure AI Search uses an HNSW graph for approximate nearest-neighbor vector search by default, and cosine similarity as the default distance metric for text-embedding-3 vectors.
How many dimensions does a text-embedding-3-large vector contain by default?
Which retrieval approach does Microsoft recommend for the highest-quality RAG results?
Why must large documents be split into chunks before generating embeddings?
In a RAG system prompt, instructing the model to reply 'I don't have enough information' when context lacks the answer primarily serves to: