A RAG chatbot must answer using only a supplied policy document. It adds a detail that is true in the real world but is not stated anywhere in the document. Which quality attribute has primarily failed?

Faithfulness. As the section states, faithfulness (groundedness) requires that every claim be supported by the provided context/source. Adding a detail absent from the document is a hallucination relative to source, even when the detail is factually true.

Which measurement most directly evaluates factuality (correctness) rather than faithfulness?

Comparing each claim against a trusted gold reference or knowledge base. The section defines factuality as truth against the real world, measured by comparing claims to trusted ground truth (gold reference or knowledge base). Entailment against retrieved passages measures faithfulness; grammar-error rate measures fluency; question-answer similarity measures relevance.

A generated response is grammatical and reads naturally, but its second paragraph contradicts a claim it made in the first. Which attribute is satisfied and which is violated?

Fluency satisfied; coherence violated. Per the section, fluency is surface language quality (grammatical and natural), which the answer has, while coherence requires logical consistency with no self-contradiction, which the answer lacks. Hence fluency is satisfied and coherence is violated.

Faithfulness, factuality, relevance, coheren | Free Guide 2026

Why quality attributes replace the single oracle

Traditional software testing compares an actual result to one expected result. Generative AI breaks that assumption: for a single prompt there are many acceptable outputs and no unique correct string to diff against. CT-GenAI therefore replaces "pass/fail against an expected output" with a set of measurable quality attributes. A tester scores each response against these attributes using automated metrics, human raters, or an LLM-as-a-judge, and defines acceptance thresholds instead of exact matches. This section covers the five attributes that describe the content quality of a generated response: faithfulness, factuality, relevance, coherence, and fluency. Frameworks such as RAGAS and G-Eval bundle several of these into reusable scorers, but a tester must still understand what each attribute means before trusting a metric.

Faithfulness (groundedness)

Faithfulness — also called groundedness — measures whether every claim in the output is supported by the provided context or source material. It is the central attribute in retrieval-augmented generation (RAG), where the model is handed documents and must answer only from them. An output is faithful if a reader can point to the supplied context to justify each statement. A faithfulness failure is a hallucination relative to source: the model adds, contradicts, or extrapolates beyond what the context supports — even when the added claim happens to be true.

How to measure: decompose the answer into atomic claims and check each claim for entailment against the retrieved context. Useful signals are the proportion of claims entailed by the source, natural-language-inference (NLI) entailment scores, citation overlap between answer spans and source spans, and an LLM-as-a-judge that is explicitly instructed to use only the supplied passages.

Factuality (correctness)

Factuality — or factual correctness — measures whether the output is true against the real world, independent of any supplied context. A model can be perfectly faithful yet factually wrong (the source document itself was outdated), or factually correct yet unfaithful (the answer is true but not supported by the given context). Testers must keep the two apart: faithfulness is checked against the source you gave the system, while factuality is checked against trusted ground truth or a reference answer.

How to measure: compare claims to a curated knowledge base, a gold reference answer, or fact-checking datasets; use exact-match or F1 against reference answers for closed questions; or route claims through a verification model or search-backed checker.

Why faithfulness and factuality diverge (worked example)

Suppose a retrieved passage states "the tower is 300 metres tall" and the model answers "300 metres." The answer is faithful (it matches the source) but may be factually wrong if the passage is outdated. Conversely, if the model answers "330 metres" from its own training memory, it may be factually correct yet unfaithful. In a RAG system you usually optimise faithfulness first, because a grounded-but-wrong source is a data problem, not a model problem — and only a grounded answer is traceable and auditable.

Relevance, coherence, and fluency

Relevance measures whether the output actually answers the user's query and stays on topic — no ignored constraints, no off-topic padding. Measure it with answer-relevance metrics (embedding similarity between question and answer), rubric scoring for "did it address every part of the ask," and precision/recall over the required points.

Coherence measures whether the response is logically consistent and well-structured: claims do not contradict each other and the argument flows. Measure it with human or LLM rubric scores, contradiction detection across sentences, and structure checks for ordering and referential consistency.

Fluency measures surface language quality: grammatical, natural, readable prose. Measure it with grammar/error-rate checkers, the perplexity of a reference language model, and readability scores. Fluency is necessary but weak on its own — modern models are almost always fluent, so a smooth answer can still be unfaithful, wrong, or irrelevant. This is why fluency is never used as the sole acceptance criterion.

Attribute reference table

Attribute	Definition	How to measure
Faithfulness / groundedness	Output supported by the provided context/source	Claim-level entailment vs source, NLI scores, citation overlap, source-only LLM judge
Factuality / correctness	Output true against real-world facts	Compare to gold reference/knowledge base, exact-match/F1, fact-checker model
Relevance	Answers the actual query, stays on topic	Question-answer similarity, rubric "addressed the ask," point recall
Coherence	Logically consistent, well-structured	Contradiction detection, structure checks, human/LLM rubric
Fluency	Grammatical, natural language	Grammar-error rate, reference-model perplexity, readability

Combining attributes into acceptance criteria

No single attribute defines a good answer. A production test suite sets a threshold per attribute and often weights them by risk: for a RAG support bot, faithfulness and relevance dominate; for a creative assistant, fluency and coherence matter more. Testers report the attributes separately rather than collapsing them into one score, because a blended average hides the exact failure — a fluent, coherent, irrelevant answer and an unfaithful one both look mediocre in aggregate but need very different fixes.

Exam tip: if a question describes an answer that invents details not in the provided document, the attribute at fault is faithfulness, not factuality — even when the invented detail is objectively true.

ISTQB Certified Tester — Testing with Generative AI

ISTQB Generative AI Testing Specialist (CT-GenAI)

2.1 Faithfulness, factuality, relevance, coherence & fluency

Key Takeaways

Why quality attributes replace the single oracle

Faithfulness (groundedness)

Factuality (correctness)

Why faithfulness and factuality diverge (worked example)

Relevance, coherence, and fluency

Attribute reference table

Combining attributes into acceptance criteria

ISTQB Certified Tester — Testing with Generative AI

1GenAI Foundations for Testers

2Quality Attributes for GenAI

3Test Design for Non-Determinism

4GenAI Risks & Mitigation

5Test Infrastructure & Tooling

6Organizational Adoption

ISTQB Generative AI Testing Specialist (CT-GenAI)

2.1 Faithfulness, factuality, relevance, coherence & fluency

Key Takeaways

Why quality attributes replace the single oracle

Faithfulness (groundedness)

Factuality (correctness)

Why faithfulness and factuality diverge (worked example)

Relevance, coherence, and fluency

Attribute reference table

Combining attributes into acceptance criteria