RAG, Grounding, Embeddings, and Azure AI Search
Key Takeaways
- Retrieval-augmented generation grounds responses in trusted content by retrieving relevant chunks at query time and passing them to the model as context.
- Chunking quality matters because the retriever matches chunks, not whole libraries; preserve headings, source IDs, permissions, and citation metadata.
- Embeddings turn text or multimodal content into vectors so Azure AI Search can retrieve semantic matches even when query words differ from document words.
- Hybrid search combines keyword and vector retrieval, while semantic ranker can rerank rich text results for better relevance in RAG workloads.
- Evaluate RAG at both stages: retrieval relevance for the chunks and groundedness, relevance, completeness, safety, and citation quality for the final answer.
Why RAG Is Tested Heavily
Retrieval-augmented generation, or RAG, is the standard pattern for answering with current or private knowledge that a base model was not trained on. The model still generates the final answer, but the factual material comes from retrieved grounding data. AI-103 questions often ask you to diagnose hallucinations, select Azure AI Search features, or decide whether RAG, fine-tuning, or an agent tool is the right answer.
Classic RAG Pipeline
| Stage | What happens | Common AI-103 trap |
|---|---|---|
| Ingest | Bring in documents, pages, records, images, or PDFs | Ignoring permissions and source metadata |
| Chunk | Split content into meaningful passages | Chunks too large, too small, or missing headings |
| Enrich | OCR, layout extraction, language analysis, cleanup | Treating scanned PDFs as plain text |
| Embed | Convert chunks and sometimes queries into vectors | Changing embedding model without regenerating vectors |
| Index | Store text, vectors, filters, fields, and citations | Forgetting filterable fields for security trimming |
| Retrieve | Run vector, keyword, hybrid, or agentic retrieval | Using only one method when recall is poor |
| Generate | Put selected chunks into the prompt and answer with citations | Letting the model answer beyond retrieved evidence |
| Evaluate | Score retrieval and response quality | Testing only final prose, not retrieved chunks |
Chunking and Grounding
Good RAG starts before the first chat request. A policy handbook, API reference, or contract collection should be chunked around natural structure: headings, sections, tables, paragraphs, and page boundaries. Each chunk should carry document title, URL or blob path, page or section, last modified date, access-control metadata, and any citation fields the UI needs.
Chunk overlap can help when answers span section boundaries, but too much overlap increases cost and duplicate retrieval. Very large chunks waste context window space. Very tiny chunks lose the explanation needed to answer a question. The practical target is not a magic token number; it is the smallest chunk that still preserves enough meaning for an answer and citation.
Embeddings, Vector Search, Hybrid Search, and Semantic Ranker
An embedding model converts text or supported multimodal content into numeric vectors. Azure AI Search can store those vectors and perform nearest-neighbor retrieval for semantic similarity. That is why a query for access badge replacement can find a policy chunk titled lost credential procedure even without exact word overlap.
Pure vector search is strong for concepts. Keyword search is strong for exact identifiers, product names, error codes, and legal terms. Hybrid search runs both and merges the result sets, which is often the best default for enterprise RAG. Semantic ranker adds another query-time step that uses language understanding to rerank text-rich results. Use it when measured relevance improves enough to justify latency and cost.
Azure AI Search also supports integrated vectorization, where indexers and skillsets can chunk and embed content during indexing instead of requiring custom ingestion code. For production, add security trimming so users retrieve only chunks they are allowed to see.
Citations and Refusal Behavior
Citations are not decorative. They let the user verify a generated claim against the source chunk. A grounded app should render citation metadata, avoid unsupported claims, and refuse or ask a follow-up question when retrieval does not provide enough evidence.
Official Anchors
- RAG in Azure AI Search: https://learn.microsoft.com/en-us/azure/search/retrieval-augmented-generation-overview
- Vector search overview: https://learn.microsoft.com/en-us/azure/search/vector-search-overview
- Hybrid search in Azure AI Search: https://learn.microsoft.com/en-us/azure/search/hybrid-search-how-to-query
- RAG evaluators in Foundry: https://learn.microsoft.com/en-us/azure/foundry/concepts/evaluation-evaluators/rag-evaluators
A benefits assistant answers from company PDFs. It misses queries that use everyday employee wording instead of policy terminology, and it also misses exact plan codes when employees provide them. What retrieval approach is the best first fix?