6.3 Embeddings and Vector Search for RAG
Key Takeaways
- Embeddings convert text into dense numerical vectors that capture semantic meaning — similar texts produce similar vectors.
- Azure OpenAI embedding models: text-embedding-3-large (3,072 dimensions), text-embedding-3-small (1,536 dimensions), and text-embedding-ada-002 (1,536 dimensions, legacy).
- The RAG pattern: embed a user query → search for similar vectors in Azure AI Search → pass retrieved documents as context to GPT → generate a grounded response.
- Chunking strategies split large documents into smaller segments before embedding — chunk size affects retrieval precision and context relevance.
- Hybrid search combines keyword (BM25) and vector search, with optional semantic ranking, for the most comprehensive retrieval results.
Embeddings and Vector Search for RAG
Quick Answer: Embeddings convert text to vectors that capture meaning. The RAG pattern: embed query → vector search in AI Search → pass results as context to GPT → generate grounded response. Use hybrid search (keyword + vector + semantic ranking) for best results. Chunk documents before embedding.
Generating Embeddings
from openai import AzureOpenAI
client = AzureOpenAI(
api_key="<your-key>",
api_version="2024-06-01",
azure_endpoint="https://my-openai.openai.azure.com/"
)
# Generate an embedding
response = client.embeddings.create(
model="text-embedding-3-large-deployment",
input="What is machine learning?"
)
embedding = response.data[0].embedding
print(f"Dimensions: {len(embedding)}") # 3072 for text-embedding-3-large
print(f"First 5 values: {embedding[:5]}")
Embedding Models Comparison
| Model | Dimensions | Max Tokens | Cost | Use Case |
|---|---|---|---|---|
| text-embedding-3-large | 3,072 | 8,191 | Higher | Best quality, large-scale RAG |
| text-embedding-3-small | 1,536 | 8,191 | Lower | Cost-effective, good quality |
| text-embedding-ada-002 | 1,536 | 8,191 | Medium | Legacy, still widely used |
Chunking Strategies
Large documents must be split into chunks before embedding:
Strategies
| Strategy | Description | Pros | Cons |
|---|---|---|---|
| Fixed-size | Split every N characters/tokens | Simple, predictable | May cut mid-sentence |
| Sentence-based | Split on sentence boundaries | Preserves meaning | Uneven chunk sizes |
| Paragraph-based | Split on paragraph breaks | Natural boundaries | Very uneven sizes |
| Semantic | Use NLP to split on topic shifts | Best relevance | Most complex |
| Sliding window | Overlapping chunks | Preserves context | Redundant storage |
Chunk Size Recommendations
| Chunk Size | When to Use |
|---|---|
| 256-512 tokens | Fine-grained retrieval, FAQ-style Q&A |
| 512-1024 tokens | General-purpose RAG (most common) |
| 1024-2048 tokens | Complex topics requiring more context |
Implementing Chunking
def chunk_text(text, chunk_size=500, overlap=100):
"""Split text into overlapping chunks."""
words = text.split()
chunks = []
start = 0
while start < len(words):
end = start + chunk_size
chunk = " ".join(words[start:end])
chunks.append(chunk)
start = end - overlap # Overlap with previous chunk
return chunks
# Chunk a document
chunks = chunk_text(document_text, chunk_size=500, overlap=100)
# Embed each chunk
for i, chunk in enumerate(chunks):
embedding = client.embeddings.create(
model="text-embedding-3-large-deployment",
input=chunk
).data[0].embedding
# Index chunk + embedding in Azure AI Search
search_client.upload_documents([{
"id": f"doc1-chunk{i}",
"content": chunk,
"contentVector": embedding
}])
End-to-End RAG Implementation
def rag_query(user_question, search_client, openai_client):
"""Complete RAG pipeline: embed → search → generate."""
# Step 1: Embed the user question
query_embedding = openai_client.embeddings.create(
model="text-embedding-3-large-deployment",
input=user_question
).data[0].embedding
# Step 2: Hybrid search (keyword + vector + semantic)
results = search_client.search(
search_text=user_question, # Keyword search
vector_queries=[
VectorizedQuery(
vector=query_embedding,
k_nearest_neighbors=5,
fields="contentVector"
)
],
query_type="semantic",
semantic_configuration_name="my-semantic-config",
top=5
)
# Step 3: Build context from search results
context = "\n\n".join([
f"Source: {r['title']}\n{r['content']}"
for r in results
])
# Step 4: Generate grounded response
response = openai_client.chat.completions.create(
model="gpt4o-deployment",
messages=[
{
"role": "system",
"content": """You are a helpful assistant that answers questions
based ONLY on the provided context. If the context does not contain
the answer, say "I don't have enough information to answer that question."
Always cite the source document when possible."""
},
{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {user_question}"
}
],
temperature=0.3 # Low temperature for factual responses
)
return response.choices[0].message.content
Hybrid Search Configuration
Hybrid search combines multiple search methods:
[User Query]
├── [Keyword Search (BM25)] → Results ranked by term frequency
├── [Vector Search (Cosine)] → Results ranked by semantic similarity
└── [Reciprocal Rank Fusion] → Merge and re-rank combined results
└── [Semantic Ranking] → Final re-ranking by language model
└── [Top Results] → Returned to the application
On the Exam: Hybrid search (keyword + vector) with semantic ranking is the recommended approach for RAG applications. Vector-only search may miss keyword matches; keyword-only search misses semantic matches. Hybrid with semantic ranking provides the best of both worlds.
On Your Data Feature
Azure OpenAI "On Your Data" provides a managed RAG experience:
- Connect your Azure AI Search index to Azure OpenAI
- Queries automatically retrieve relevant documents from the index
- Retrieved documents are injected into the prompt as context
- GPT generates a grounded response with citations
- No custom code required for basic RAG scenarios
# Using Azure OpenAI "On Your Data"
response = client.chat.completions.create(
model="gpt4o-deployment",
messages=[
{"role": "user", "content": "What is our company's return policy?"}
],
extra_body={
"data_sources": [{
"type": "azure_search",
"parameters": {
"endpoint": "https://my-search.search.windows.net",
"index_name": "company-docs",
"authentication": {
"type": "api_key",
"key": "<search-key>"
}
}
}]
}
)
What is the primary purpose of text embeddings in RAG?
Which search approach provides the best results for RAG applications?
Why is document chunking important before generating embeddings?
In the RAG system prompt, why should you instruct the model to say "I don't have enough information" when the context doesn't contain an answer?