A team scores an open-ended chatbot's answers with BLEU against a single reference answer and is surprised that clearly correct, well-phrased responses receive low scores. What is the best explanation?

BLEU rewards n-gram overlap with the reference, so correct answers worded differently score poorly. The section explains that n-gram metrics like BLEU reward surface overlap, not meaning, so a correct answer phrased differently from the reference scores poorly — a known weakness for open-ended generation. BLEU is not embedding-based and is not an LLM judge.

Which limitation is specific to using LLM-as-a-judge for scoring outputs?

It is prone to position, verbosity, and self-preference bias and is non-deterministic. The section lists position bias, verbosity bias, self-preference, prompt sensitivity, and non-determinism as LLM-as-judge limitations. Being reference-free is actually a strength; counting n-grams describes BLEU/ROUGE, and determinism is the opposite of the judge's behaviour.

A tester needs to check that model answers still mean the same thing as a golden reference even when the wording changes, without paying for an LLM judge on every run. Which metric is the most appropriate choice?

BERTScore or cosine similarity over embeddings, because they capture semantic similarity across paraphrases. The section recommends semantic similarity (cosine/BERTScore) when meaning must match but wording can vary, or to detect drift against a golden answer. ROUGE is n-gram overlap, TruthfulQA is a benchmark, and LLM-as-judge is the more expensive, non-deterministic option here.

Automated metrics — Free Study Guide 2026

From human review to automated scoring

Manual review of generative outputs is accurate but slow and expensive, so testers rely on automated metrics to score outputs at scale. No single metric is authoritative; each captures a different notion of "good," and each has failure modes. The CT-GenAI perspective is to know which metric family fits which task, and to combine them rather than trust one number. Automated metrics fall into three broad families: reference-based n-gram overlap, embedding or semantic similarity, and model-based judging.

A useful cross-cutting axis is whether a metric is reference-based (it needs a human-written gold answer to compare against) or reference-free (it scores the output directly). N-gram and embedding metrics are reference-based; LLM-as-judge and several RAG metrics can operate reference-free. Reference-based metrics are cheaper and more objective but demand costly labelled data and penalise valid variation; reference-free metrics scale to open-ended tasks but need careful calibration.

Reference-based n-gram overlap

These metrics compare a candidate output against one or more human reference texts by counting overlapping words or n-grams.

BLEU measures precision of overlapping n-grams; it was designed for machine translation and rewards outputs that reuse reference wording.
ROUGE emphasises recall of overlapping n-grams and longest common subsequences; it is common for summarisation.
METEOR aligns candidate and reference tokens and credits synonyms, stemming, and word order, partially addressing BLEU's rigidity.

Concretely, BLEU combines modified n-gram precision (typically up to 4-grams) with a brevity penalty that discourages overly short outputs, while ROUGE is usually reported as ROUGE-1, ROUGE-2, and ROUGE-L (longest common subsequence). All three assume the reference text is a reasonable gold standard, which breaks down when there are many equally valid answers.

Their appeal is that they are fast, cheap, deterministic, and reproducible. Their limitation is that they reward surface overlap, not meaning: a correct answer phrased differently from the reference scores poorly, while a fluent-but-wrong answer that echoes reference words can score well. For open-ended generation — creative writing, chat, reasoning — where many valid outputs exist and no single reference is complete, n-gram metrics correlate weakly with human judgement.

Embedding and semantic similarity

To capture meaning rather than exact wording, semantic metrics compare vector representations of texts.

Cosine similarity over sentence or document embeddings measures how close two texts sit in vector space, so paraphrases with different words can still score highly.
BERTScore matches contextual token embeddings between candidate and reference and reports precision, recall, and F1, giving credit for semantically equivalent tokens.

These handle paraphrase far better than n-gram overlap and are useful when the meaning must match a reference. Their limits: scores depend on the embedding model chosen, they can rate topically-similar-but-factually-wrong text as similar, and they still require a reference. High similarity does not prove factual correctness.

LLM-as-a-judge

A strong model can be prompted to score another model's output, either against a rubric or by comparing two candidates.

Pairwise comparison asks the judge which of two responses is better — robust for ranking competing systems.
Rubric or direct scoring asks the judge to rate an output on defined criteria (helpfulness, correctness, safety), often on a numeric scale.

LLM-as-judge is reference-free, flexible, and correlates better with human preference on open-ended tasks than n-gram metrics. But it carries real biases the tester must control: position bias (favouring the first or second option), verbosity bias (preferring longer answers), self-preference (favouring outputs from the same model family), and sensitivity to prompt wording. Mitigations include swapping option order, fixing clear rubrics, using few-shot calibration, and validating the judge against human labels. It also costs more and is itself non-deterministic.

Metric family	What it measures	Key limitation
BLEU / ROUGE / METEOR	N-gram overlap with a reference text	Rewards surface wording, misses meaning; weak for open-ended tasks
Cosine / BERTScore	Semantic similarity via embeddings	Needs a reference; similarity is not correctness; model-dependent
LLM-as-judge	Model-scored quality vs rubric or pairwise	Position/verbosity/self-preference bias; costly; non-deterministic

Choosing the right metric

Use n-gram metrics when there is a close, canonical reference (translation, tightly-specified summaries) and you need cheap regression signals.
Use semantic similarity when meaning must match but wording can vary, or to detect drift against a golden answer.
Use LLM-as-judge for open-ended quality, subjective criteria, or when references are unavailable — but calibrate and audit it.

The practical guidance is triangulation: combine a fast deterministic metric for regressions, a semantic metric for meaning, and judged or human review for the dimensions that matter most. Every metric is a proxy, so testers report score distributions and track them over time rather than treating a single value as ground truth.

Statistical rigour

Whatever the metric, a single score on a handful of examples is noise. Testers should evaluate on a sufficiently large, representative sample, report the score distribution or a confidence interval, and compare candidate systems on the same fixed dataset. Small differences between two systems may not be meaningful, so tracking a metric's trend across builds — and confirming that a change is larger than run-to-run variance — is more informative than any single absolute value.

ISTQB Certified Tester — Testing with Generative AI

ISTQB Generative AI Testing Specialist (CT-GenAI)

5.2 Automated metrics (BLEU/ROUGE/METEOR, embedding similarity, LLM-as-judge)

Key Takeaways

From human review to automated scoring

Reference-based n-gram overlap

Embedding and semantic similarity

LLM-as-a-judge

Choosing the right metric

Statistical rigour

ISTQB Certified Tester — Testing with Generative AI

1GenAI Foundations for Testers

2Quality Attributes for GenAI

3Test Design for Non-Determinism

4GenAI Risks & Mitigation

5Test Infrastructure & Tooling

6Organizational Adoption

ISTQB Generative AI Testing Specialist (CT-GenAI)

5.2 Automated metrics (BLEU/ROUGE/METEOR, embedding similarity, LLM-as-judge)

Key Takeaways

From human review to automated scoring

Reference-based n-gram overlap

Embedding and semantic similarity

LLM-as-a-judge

Choosing the right metric

Statistical rigour