2.1 Faithfulness, factuality, relevance, coherence & fluency
Key Takeaways
- Faithfulness (groundedness) means the output is supported by the provided context/source; it is the central attribute in RAG.
- Factuality (correctness) means the output is true against the real world; a response can be faithful yet factually wrong, or factually correct yet unfaithful.
- Relevance measures whether the answer actually addresses the query; coherence measures logical consistency and structure.
- Fluency measures grammatical, natural language and is necessary but weak alone, since modern models are almost always fluent.
- Each attribute has measurable signals: claim-level entailment for faithfulness, gold-reference comparison for factuality, and rubric or embedding scores for relevance/coherence.
Why quality attributes replace the single oracle
Traditional software testing compares an actual result to one expected result. Generative AI breaks that assumption: for a single prompt there are many acceptable outputs and no unique correct string to diff against. CT-GenAI therefore replaces "pass/fail against an expected output" with a set of measurable quality attributes. A tester scores each response against these attributes using automated metrics, human raters, or an LLM-as-a-judge, and defines acceptance thresholds instead of exact matches. This section covers the five attributes that describe the content quality of a generated response: faithfulness, factuality, relevance, coherence, and fluency. Frameworks such as RAGAS and G-Eval bundle several of these into reusable scorers, but a tester must still understand what each attribute means before trusting a metric.
Faithfulness (groundedness)
Faithfulness — also called groundedness — measures whether every claim in the output is supported by the provided context or source material. It is the central attribute in retrieval-augmented generation (RAG), where the model is handed documents and must answer only from them. An output is faithful if a reader can point to the supplied context to justify each statement. A faithfulness failure is a hallucination relative to source: the model adds, contradicts, or extrapolates beyond what the context supports — even when the added claim happens to be true.
How to measure: decompose the answer into atomic claims and check each claim for entailment against the retrieved context. Useful signals are the proportion of claims entailed by the source, natural-language-inference (NLI) entailment scores, citation overlap between answer spans and source spans, and an LLM-as-a-judge that is explicitly instructed to use only the supplied passages.
Factuality (correctness)
Factuality — or factual correctness — measures whether the output is true against the real world, independent of any supplied context. A model can be perfectly faithful yet factually wrong (the source document itself was outdated), or factually correct yet unfaithful (the answer is true but not supported by the given context). Testers must keep the two apart: faithfulness is checked against the source you gave the system, while factuality is checked against trusted ground truth or a reference answer.
How to measure: compare claims to a curated knowledge base, a gold reference answer, or fact-checking datasets; use exact-match or F1 against reference answers for closed questions; or route claims through a verification model or search-backed checker.
Why faithfulness and factuality diverge (worked example)
Suppose a retrieved passage states "the tower is 300 metres tall" and the model answers "300 metres." The answer is faithful (it matches the source) but may be factually wrong if the passage is outdated. Conversely, if the model answers "330 metres" from its own training memory, it may be factually correct yet unfaithful. In a RAG system you usually optimise faithfulness first, because a grounded-but-wrong source is a data problem, not a model problem — and only a grounded answer is traceable and auditable.
Relevance, coherence, and fluency
Relevance measures whether the output actually answers the user's query and stays on topic — no ignored constraints, no off-topic padding. Measure it with answer-relevance metrics (embedding similarity between question and answer), rubric scoring for "did it address every part of the ask," and precision/recall over the required points.
Coherence measures whether the response is logically consistent and well-structured: claims do not contradict each other and the argument flows. Measure it with human or LLM rubric scores, contradiction detection across sentences, and structure checks for ordering and referential consistency.
Fluency measures surface language quality: grammatical, natural, readable prose. Measure it with grammar/error-rate checkers, the perplexity of a reference language model, and readability scores. Fluency is necessary but weak on its own — modern models are almost always fluent, so a smooth answer can still be unfaithful, wrong, or irrelevant. This is why fluency is never used as the sole acceptance criterion.
Attribute reference table
| Attribute | Definition | How to measure |
|---|---|---|
| Faithfulness / groundedness | Output supported by the provided context/source | Claim-level entailment vs source, NLI scores, citation overlap, source-only LLM judge |
| Factuality / correctness | Output true against real-world facts | Compare to gold reference/knowledge base, exact-match/F1, fact-checker model |
| Relevance | Answers the actual query, stays on topic | Question-answer similarity, rubric "addressed the ask," point recall |
| Coherence | Logically consistent, well-structured | Contradiction detection, structure checks, human/LLM rubric |
| Fluency | Grammatical, natural language | Grammar-error rate, reference-model perplexity, readability |
Combining attributes into acceptance criteria
No single attribute defines a good answer. A production test suite sets a threshold per attribute and often weights them by risk: for a RAG support bot, faithfulness and relevance dominate; for a creative assistant, fluency and coherence matter more. Testers report the attributes separately rather than collapsing them into one score, because a blended average hides the exact failure — a fluent, coherent, irrelevant answer and an unfaithful one both look mediocre in aggregate but need very different fixes.
Exam tip: if a question describes an answer that invents details not in the provided document, the attribute at fault is faithfulness, not factuality — even when the invented detail is objectively true.
A RAG chatbot must answer using only a supplied policy document. It adds a detail that is true in the real world but is not stated anywhere in the document. Which quality attribute has primarily failed?
Which measurement most directly evaluates factuality (correctness) rather than faithfulness?
A generated response is grammatical and reads naturally, but its second paragraph contradicts a claim it made in the first. Which attribute is satisfied and which is violated?