A retriever returns 20 relevant passages, but together they exceed the model's context budget. What is the best next step?

Rerank the passages, keep the top-k that fit, and drop or summarize the rest. Reranking and keeping the highest-relevance passages that fit spends the limited token budget on the strongest evidence. Sending everything, removing the question, or changing temperature does not solve the context-window overflow and usually hurts answer quality.

An enterprise RAG assistant sometimes answers confidently about policies that do not exist in the indexed corpus. Which prompt-side change most directly reduces this?

Instruct the model to use only the provided context and to say it does not know when the answer is unsupported. A grounding-plus-abstention instruction tells the model to treat retrieved context as the source of truth and to decline when evidence is missing, which directly curbs confident hallucination. Higher temperature, longer answers, or removing context all make unsupported claims more likely, not less.

In a Databricks RAG pipeline, which statement best captures the relationship between retrieval and final answer quality?

Answer quality is bounded by retrieval quality, so poor retrieval cannot be rescued by a stronger generator. The model can only reason over the chunks it is given; if retrieval returns the wrong evidence, even the best generator produces a wrong or unsupported answer. Retrieval is therefore a first-class concern rather than something a larger model can compensate for.

Building a RAG Pipeline — Free Study Guide 2026

Why RAG Is the Default Pattern on Databricks

Retrieval-Augmented Generation (RAG) is the architecture you reach for when a generative AI application must answer from proprietary, frequently changing, or citable knowledge that the base model never saw during training. Instead of baking facts into model weights (which fine-tuning does), RAG keeps knowledge in an external store and fetches only the relevant slice at query time. On Databricks that store is Mosaic AI Vector Search, the source content lives in Delta tables under Unity Catalog governance, and the generator is a chat model served through the Foundation Model APIs or a custom Model Serving endpoint. Because Application Development is the heaviest domain at 30% of the exam, expect several scenarios that turn on assembling this flow correctly.

A textbook exam stem: an HR assistant must answer from policy documents that change every week and cite its sources. This is the canonical RAG trigger. Fine-tuning cannot absorb weekly updates and gives no source attribution; hardcoding answers into a system prompt goes stale the moment documents change. RAG scales because the knowledge lives outside the model and is fetched on demand.

The Three Stages: Retrieve, Augment, Generate

The pipeline runs in three logical stages, expanded from the mnemonic Ingest, Chunk, Embed, Index, Retrieve, Rerank, Generate:

Stage	What happens	Databricks component
Retrieval	Embed the query, run similarity search, return top-k chunks	Vector Search index + endpoint
Augmentation	Merge the question and retrieved chunks into a grounded prompt	Prompt template / prompt-building step
Generation	The LLM produces an answer conditioned on the injected context	`ChatDatabricks` / Foundation Model API

A principle the exam rewards repeatedly: answer quality is bounded by retrieval quality. If the wrong chunks come back, even the strongest generator produces a wrong or unsupported answer. Retrieval is therefore a first-class concern, not an afterthought, and most RAG failures are retrieval failures in disguise.

Connecting a Vector Search Retriever to an LLM

Retrieval starts with an index hosted on a Vector Search endpoint (the endpoint is the compute resource; it can host several indexes and is sized independently from the data). A Delta Sync index follows a source Delta table and refreshes automatically as rows change, so it is the default when the source is Delta; it requires Change Data Feed enabled on that table. A Direct Vector Access index skips the sync and requires you to supply and manage the vectors yourself, which only makes sense when embeddings come from outside Databricks.

The query must be embedded with the same embedding model used to build the index; mismatched dimensions or a different semantic space make similarity meaningless and can cause queries to fail. At query time the retriever returns the top-k most similar chunks. In application code the standard pattern is to wrap the index and expose a retrieval interface (.as_retriever()), which plugs cleanly into chains and agents. Optional refinements include metadata filters (restrict to a source, language, or date such as 2026 policies only) and hybrid search (blend semantic similarity with keyword matching so rare codes, IDs, and proper nouns are not missed).

Augmentation: Injecting Context and Grounding the Model

Augmentation assembles the model-ready input. Two design choices dominate the questions. First, placement and delimiters: put the retrieved context close to the question, wrap it in clear delimiters, and instruct the model to ground its answer in that evidence. Passages dumped after the question with no delimiters, or shuffled randomly on each call, weaken the model's ability to use them. Second, grounding and abstention: tell the model to answer only from the supplied context and to say it does not know when the answer is unsupported. This single instruction is the cheapest defense against confident hallucination, which is costly in enterprise settings.

To support citations, carry the source metadata with each chunk and instruct the model to reference it; a confidence score alone never tells the user where a claim came from. To personalize, inject user-context fields such as account tier or product SKU directly into the prompt the model sees — storing them only in index metadata helps retrieval but never reaches generation.

Handling No-Answer and Low-Relevance Cases

When top similarity scores are low, the honest behavior is to abstain rather than fabricate. Combine a similarity threshold or metadata filter with an explicit abstention instruction. A second, common constraint is a tight context budget: a retriever may return twenty relevant passages that together blow past the model's window. The correct move is to rerank the candidates, keep the top-k that fit, and drop or summarize the rest — never concatenate all twenty and hope, never keep only the first passage by position, and never drop the user's question to make room. Reranking spends the token budget on the best evidence and raises precision when only a few passages can fit.

End-to-End Example

The weekly HR assistant ties it together: ingest policy PDFs into a Delta table, chunk with overlap, embed each chunk, and build a Delta Sync index so weekly edits flow through automatically. At query time embed the question, retrieve the top candidates with a metadata filter for the current policy year, rerank down to what fits, assemble a delimited prompt that carries source titles and instructs grounding plus abstention, and send it to a ChatDatabricks model. The returned answer cites its policy sections, and if retrieval finds nothing relevant, the assistant says so instead of inventing a clause.

Databricks Generative AI Engineer Associate Certification

Databricks Generative AI Engineer Associate

4.1 Building a RAG Pipeline

Key Takeaways

Why RAG Is the Default Pattern on Databricks

The Three Stages: Retrieve, Augment, Generate

Connecting a Vector Search Retriever to an LLM

Augmentation: Injecting Context and Grounding the Model

Handling No-Answer and Low-Relevance Cases

End-to-End Example

Databricks Generative AI Engineer Associate Certification

1Introduction & Exam Strategy

2Design Applications

3Data Preparation

4Application Development

5Assembling & Deploying Applications

6Governance, Evaluation & Monitoring

Databricks Generative AI Engineer Associate

4.1 Building a RAG Pipeline

Key Takeaways

Why RAG Is the Default Pattern on Databricks

The Three Stages: Retrieve, Augment, Generate

Connecting a Vector Search Retriever to an LLM

Augmentation: Injecting Context and Grounding the Model

Handling No-Answer and Low-Relevance Cases

End-to-End Example