A RAG answer is fluent and on-topic but includes a claim that appears in none of the retrieved documents. Which validation target does this failure belong to?

Grounding and faithfulness — the generator added an unsupported claim that does not trace to the retrieved context. When context was retrieved but the answer asserts something not present in it, the defect is a generation-stage faithfulness failure. A faithfulness check confirms every claim traces back to a source, which is the core RAG oracle.

Why does testing an agent tend to be harder than testing a single LLM call?

Agents chain many LLM and tool calls, so non-determinism compounds and a single wrong step can derail downstream results. An agent's planning loop chains multiple LLM and tool calls, and randomness at each step multiplies across the whole task. That compounding non-determinism, plus tool failures, forces testers to check intermediate states rather than only the final answer.

A tester wants to localize the cause of a RAG failure. Which distinction is most useful?

Whether the failure is a retrieval failure (wrong or missing context) or a generation failure (context was right but ignored). RAG problems fall into retrieval failures, where the wrong or no context was fetched, or generation failures, where correct context was ignored or contradicted. Separating the two matters because the fixes are entirely different.

RAG & agent architectures testers must under | Free Guide 2026

RAG and agents: architectures testers must validate

Two architectures dominate real GenAI products: retrieval-augmented generation (RAG), which grounds answers in external data, and agents, which let a model take actions. Each introduces distinct components and therefore distinct test targets, and each moves risk out of the model's weights and into the surrounding system.

RAG: grounding to reduce hallucination

A RAG system supplements the model's built-in knowledge with retrieved documents so answers are grounded in a trusted source rather than invented. The pipeline has three core parts:

Retriever — embeds the user query and searches a vector store for the most similar chunks, typically by cosine similarity.
Vector store / knowledge base — documents are split into chunks, embedded and indexed ahead of time.
Generator — the LLM receives the query plus the retrieved chunks as context and composes the final answer.

RAG reduces hallucination and lets a system cite current, domain-specific facts without retraining, but it introduces new, layered failure points that testers own.

Retrieval relevance. Did the retriever return chunks that actually contain the answer? Poor retrieval starves the generator, and no amount of prompt tuning fixes it. Test with known query-document pairs and measure the precision and recall of retrieval.
Grounding / faithfulness. Does the generated answer stay faithful to the retrieved context, or does it add unsupported claims? A faithfulness check verifies that every claim traces back to a source; this is the core RAG oracle.
Chunking. Chunk size and overlap decide whether a coherent answer even exists in a single retrievable unit. Too-large chunks dilute relevance; too-small chunks split facts apart. Chunk strategy is a genuine test variable.
Freshness and coverage. If the knowledge base is stale or incomplete, answers are wrong despite a correct pipeline — so test the currency of the data, not only the code.

The RAG pipeline as a test map

Stage	Component	What testers validate
Ingest	Chunker and embedder	Chunk size and overlap; embedding quality
Store	Vector database	Index correctness; freshness and coverage
Retrieve	Retriever	Relevance (precision/recall) of returned chunks
Generate	LLM	Faithfulness and grounding; citation accuracy
Answer	Final output	Correctness, safety and required format

A useful mental model: RAG failures are either retrieval failures (wrong or missing context) or generation failures (context was right but the model ignored or contradicted it). Localizing which one occurred is a core testing skill because the fixes differ completely. A common set of measurable RAG qualities is context relevance (are the retrieved chunks on-topic?), faithfulness (is the answer supported by them?) and answer relevance (does the answer address the question?). Because manual grading does not scale, teams often use an LLM-as-a-judge to score these, and the tester must in turn validate that the judge itself is calibrated and unbiased rather than trusting it blindly.

Agents: LLMs that plan and act

An agent extends an LLM with the ability to decide and act. Its typical components are:

Tools / function-calling — the model can invoke external functions or APIs such as search, calculators, databases or code execution, and then use the results.
Memory — short-term (the conversation) and long-term (persisted facts) state that informs later steps.
Planning loop — the agent reasons, chooses an action, observes the result and repeats until it reaches a goal, an iterative reason-act cycle.

Agents unlock powerful workflows but multiply testing difficulty:

Multi-step execution. A task spans many LLM calls and tool calls; a single wrong step derails everything downstream, so tests must check intermediate states, not just the final answer.
Tool errors and handling. Tools fail, time out or return malformed data. Test that the agent handles failures gracefully instead of hallucinating a result or looping forever.
Compounding non-determinism. Randomness at each step multiplies across steps, so end-to-end behavior varies far more than a single call. Repeated-trial and variance testing matter even more here; pin sampling where you can.
Safety and permissions. Because agents act on the world — send email, run code, spend money — testers must verify guardrails, authorization boundaries and safe handling of prompt injection that tries to trigger harmful tool use.
Cost and loops. Planning loops can run away, inflating token cost and latency. Test for termination, step limits and budget ceilings.

Testing takeaways for both architectures

Decompose the system and test each component in isolation before testing end to end.
Prefer meaning-based and constraint-based oracles over exact string matching.
Control and record the model version, temperature and prompts so failures stay reproducible.
Treat retrieved context, tool outputs and memory as inputs that must themselves be validated.

RAG and agents move much of the risk into the surrounding system — the data, the retrieval, the tools and the orchestration. That is precisely where a tester adds the most value, because those layers are observable, controllable and testable with adapted classical techniques such as boundary analysis, equivalence partitioning and negative testing applied to prompts, chunks and tool inputs.

ISTQB Certified Tester — Testing with Generative AI

ISTQB Generative AI Testing Specialist (CT-GenAI)

1.3 RAG & agent architectures testers must understand

Key Takeaways

RAG and agents: architectures testers must validate

RAG: grounding to reduce hallucination

The RAG pipeline as a test map

Agents: LLMs that plan and act

Testing takeaways for both architectures

ISTQB Certified Tester — Testing with Generative AI

1GenAI Foundations for Testers

2Quality Attributes for GenAI

3Test Design for Non-Determinism

4GenAI Risks & Mitigation

5Test Infrastructure & Tooling

6Organizational Adoption

ISTQB Generative AI Testing Specialist (CT-GenAI)

1.3 RAG & agent architectures testers must understand

Key Takeaways

RAG and agents: architectures testers must validate

RAG: grounding to reduce hallucination

The RAG pipeline as a test map

Agents: LLMs that plan and act

Testing takeaways for both architectures