4.1 Building a RAG Pipeline

Key Takeaways

  • RAG runs three stages: retrieval (embed query, search Vector Search), augmentation (inject chunks into a prompt), and generation (LLM answers from that context).
  • Answer quality is bounded by retrieval quality: wrong chunks retrieved means even a strong model produces an unsupported answer.
  • Ground the model by instructing it to use only the provided context and to abstain (say it does not know) when the answer is not supported.
  • When retrieved passages exceed the context budget, rerank and keep the top-k that fit rather than concatenating everything or truncating by position.
  • Preserve source metadata alongside chunks so the model can cite where each claim came from.
Last updated: July 2026

Why RAG Is the Default Pattern on Databricks

Retrieval-Augmented Generation (RAG) is the architecture you reach for when a generative AI application must answer from proprietary, frequently changing, or citable knowledge that the base model never saw during training. Instead of baking facts into model weights (which fine-tuning does), RAG keeps knowledge in an external store and fetches only the relevant slice at query time. On Databricks that store is Mosaic AI Vector Search, the source content lives in Delta tables under Unity Catalog governance, and the generator is a chat model served through the Foundation Model APIs or a custom Model Serving endpoint. Because Application Development is the heaviest domain at 30% of the exam, expect several scenarios that turn on assembling this flow correctly.

A textbook exam stem: an HR assistant must answer from policy documents that change every week and cite its sources. This is the canonical RAG trigger. Fine-tuning cannot absorb weekly updates and gives no source attribution; hardcoding answers into a system prompt goes stale the moment documents change. RAG scales because the knowledge lives outside the model and is fetched on demand.

The Three Stages: Retrieve, Augment, Generate

The pipeline runs in three logical stages, expanded from the mnemonic Ingest, Chunk, Embed, Index, Retrieve, Rerank, Generate:

StageWhat happensDatabricks component
RetrievalEmbed the query, run similarity search, return top-k chunksVector Search index + endpoint
AugmentationMerge the question and retrieved chunks into a grounded promptPrompt template / prompt-building step
GenerationThe LLM produces an answer conditioned on the injected contextChatDatabricks / Foundation Model API

A principle the exam rewards repeatedly: answer quality is bounded by retrieval quality. If the wrong chunks come back, even the strongest generator produces a wrong or unsupported answer. Retrieval is therefore a first-class concern, not an afterthought, and most RAG failures are retrieval failures in disguise.

Connecting a Vector Search Retriever to an LLM

Retrieval starts with an index hosted on a Vector Search endpoint (the endpoint is the compute resource; it can host several indexes and is sized independently from the data). A Delta Sync index follows a source Delta table and refreshes automatically as rows change, so it is the default when the source is Delta; it requires Change Data Feed enabled on that table. A Direct Vector Access index skips the sync and requires you to supply and manage the vectors yourself, which only makes sense when embeddings come from outside Databricks.

The query must be embedded with the same embedding model used to build the index; mismatched dimensions or a different semantic space make similarity meaningless and can cause queries to fail. At query time the retriever returns the top-k most similar chunks. In application code the standard pattern is to wrap the index and expose a retrieval interface (.as_retriever()), which plugs cleanly into chains and agents. Optional refinements include metadata filters (restrict to a source, language, or date such as 2026 policies only) and hybrid search (blend semantic similarity with keyword matching so rare codes, IDs, and proper nouns are not missed).

Augmentation: Injecting Context and Grounding the Model

Augmentation assembles the model-ready input. Two design choices dominate the questions. First, placement and delimiters: put the retrieved context close to the question, wrap it in clear delimiters, and instruct the model to ground its answer in that evidence. Passages dumped after the question with no delimiters, or shuffled randomly on each call, weaken the model's ability to use them. Second, grounding and abstention: tell the model to answer only from the supplied context and to say it does not know when the answer is unsupported. This single instruction is the cheapest defense against confident hallucination, which is costly in enterprise settings.

To support citations, carry the source metadata with each chunk and instruct the model to reference it; a confidence score alone never tells the user where a claim came from. To personalize, inject user-context fields such as account tier or product SKU directly into the prompt the model sees — storing them only in index metadata helps retrieval but never reaches generation.

Handling No-Answer and Low-Relevance Cases

When top similarity scores are low, the honest behavior is to abstain rather than fabricate. Combine a similarity threshold or metadata filter with an explicit abstention instruction. A second, common constraint is a tight context budget: a retriever may return twenty relevant passages that together blow past the model's window. The correct move is to rerank the candidates, keep the top-k that fit, and drop or summarize the rest — never concatenate all twenty and hope, never keep only the first passage by position, and never drop the user's question to make room. Reranking spends the token budget on the best evidence and raises precision when only a few passages can fit.

End-to-End Example

The weekly HR assistant ties it together: ingest policy PDFs into a Delta table, chunk with overlap, embed each chunk, and build a Delta Sync index so weekly edits flow through automatically. At query time embed the question, retrieve the top candidates with a metadata filter for the current policy year, rerank down to what fits, assemble a delimited prompt that carries source titles and instructs grounding plus abstention, and send it to a ChatDatabricks model. The returned answer cites its policy sections, and if retrieval finds nothing relevant, the assistant says so instead of inventing a clause.

Test Your Knowledge

A retriever returns 20 relevant passages, but together they exceed the model's context budget. What is the best next step?

A
B
C
D
Test Your Knowledge

An enterprise RAG assistant sometimes answers confidently about policies that do not exist in the indexed corpus. Which prompt-side change most directly reduces this?

A
B
C
D
Test Your Knowledge

In a Databricks RAG pipeline, which statement best captures the relationship between retrieval and final answer quality?

A
B
C
D