1.3 RAG & agent architectures testers must understand
Key Takeaways
- RAG combines a retriever, a vector store of embedded chunks and a generator to ground answers in trusted sources and reduce hallucination.
- Testers must validate retrieval relevance separately from generation faithfulness, because the fixes for a retrieval failure and a generation failure differ.
- Chunk size and overlap are test variables: oversized chunks dilute relevance while undersized chunks split a single fact across separate retrievable units.
- Agents add tools and function-calling, memory and a planning loop, so a single wrong step or tool error can derail an entire multi-step task.
- Non-determinism compounds across an agent's steps, making repeated-trial testing, step and budget limits, and guardrails on actions essential.
RAG and agents: architectures testers must validate
Two architectures dominate real GenAI products: retrieval-augmented generation (RAG), which grounds answers in external data, and agents, which let a model take actions. Each introduces distinct components and therefore distinct test targets, and each moves risk out of the model's weights and into the surrounding system.
RAG: grounding to reduce hallucination
A RAG system supplements the model's built-in knowledge with retrieved documents so answers are grounded in a trusted source rather than invented. The pipeline has three core parts:
- Retriever — embeds the user query and searches a vector store for the most similar chunks, typically by cosine similarity.
- Vector store / knowledge base — documents are split into chunks, embedded and indexed ahead of time.
- Generator — the LLM receives the query plus the retrieved chunks as context and composes the final answer.
RAG reduces hallucination and lets a system cite current, domain-specific facts without retraining, but it introduces new, layered failure points that testers own.
- Retrieval relevance. Did the retriever return chunks that actually contain the answer? Poor retrieval starves the generator, and no amount of prompt tuning fixes it. Test with known query-document pairs and measure the precision and recall of retrieval.
- Grounding / faithfulness. Does the generated answer stay faithful to the retrieved context, or does it add unsupported claims? A faithfulness check verifies that every claim traces back to a source; this is the core RAG oracle.
- Chunking. Chunk size and overlap decide whether a coherent answer even exists in a single retrievable unit. Too-large chunks dilute relevance; too-small chunks split facts apart. Chunk strategy is a genuine test variable.
- Freshness and coverage. If the knowledge base is stale or incomplete, answers are wrong despite a correct pipeline — so test the currency of the data, not only the code.
The RAG pipeline as a test map
| Stage | Component | What testers validate |
|---|---|---|
| Ingest | Chunker and embedder | Chunk size and overlap; embedding quality |
| Store | Vector database | Index correctness; freshness and coverage |
| Retrieve | Retriever | Relevance (precision/recall) of returned chunks |
| Generate | LLM | Faithfulness and grounding; citation accuracy |
| Answer | Final output | Correctness, safety and required format |
A useful mental model: RAG failures are either retrieval failures (wrong or missing context) or generation failures (context was right but the model ignored or contradicted it). Localizing which one occurred is a core testing skill because the fixes differ completely. A common set of measurable RAG qualities is context relevance (are the retrieved chunks on-topic?), faithfulness (is the answer supported by them?) and answer relevance (does the answer address the question?). Because manual grading does not scale, teams often use an LLM-as-a-judge to score these, and the tester must in turn validate that the judge itself is calibrated and unbiased rather than trusting it blindly.
Agents: LLMs that plan and act
An agent extends an LLM with the ability to decide and act. Its typical components are:
- Tools / function-calling — the model can invoke external functions or APIs such as search, calculators, databases or code execution, and then use the results.
- Memory — short-term (the conversation) and long-term (persisted facts) state that informs later steps.
- Planning loop — the agent reasons, chooses an action, observes the result and repeats until it reaches a goal, an iterative reason-act cycle.
Agents unlock powerful workflows but multiply testing difficulty:
- Multi-step execution. A task spans many LLM calls and tool calls; a single wrong step derails everything downstream, so tests must check intermediate states, not just the final answer.
- Tool errors and handling. Tools fail, time out or return malformed data. Test that the agent handles failures gracefully instead of hallucinating a result or looping forever.
- Compounding non-determinism. Randomness at each step multiplies across steps, so end-to-end behavior varies far more than a single call. Repeated-trial and variance testing matter even more here; pin sampling where you can.
- Safety and permissions. Because agents act on the world — send email, run code, spend money — testers must verify guardrails, authorization boundaries and safe handling of prompt injection that tries to trigger harmful tool use.
- Cost and loops. Planning loops can run away, inflating token cost and latency. Test for termination, step limits and budget ceilings.
Testing takeaways for both architectures
- Decompose the system and test each component in isolation before testing end to end.
- Prefer meaning-based and constraint-based oracles over exact string matching.
- Control and record the model version, temperature and prompts so failures stay reproducible.
- Treat retrieved context, tool outputs and memory as inputs that must themselves be validated.
RAG and agents move much of the risk into the surrounding system — the data, the retrieval, the tools and the orchestration. That is precisely where a tester adds the most value, because those layers are observable, controllable and testable with adapted classical techniques such as boundary analysis, equivalence partitioning and negative testing applied to prompts, chunks and tool inputs.
A RAG answer is fluent and on-topic but includes a claim that appears in none of the retrieved documents. Which validation target does this failure belong to?
Why does testing an agent tend to be harder than testing a single LLM call?
A tester wants to localize the cause of a RAG failure. Which distinction is most useful?