5.2 Automated metrics (BLEU/ROUGE/METEOR, embedding similarity, LLM-as-judge)

Key Takeaways

  • Automated metrics fall into three families: reference-based n-gram overlap, embedding/semantic similarity, and LLM-as-a-judge.
  • BLEU (precision), ROUGE (recall), and METEOR (alignment with synonyms) are fast and deterministic but reward surface wording over meaning.
  • Embedding metrics like cosine similarity and BERTScore capture paraphrase but need a reference and do not prove factual correctness.
  • LLM-as-judge is reference-free and flexible for open-ended tasks but suffers position, verbosity, and self-preference biases that must be mitigated.
  • Best practice is triangulation: combine a cheap deterministic metric, a semantic metric, and judged/human review, reporting distributions over time.
Last updated: July 2026

From human review to automated scoring

Manual review of generative outputs is accurate but slow and expensive, so testers rely on automated metrics to score outputs at scale. No single metric is authoritative; each captures a different notion of "good," and each has failure modes. The CT-GenAI perspective is to know which metric family fits which task, and to combine them rather than trust one number. Automated metrics fall into three broad families: reference-based n-gram overlap, embedding or semantic similarity, and model-based judging.

A useful cross-cutting axis is whether a metric is reference-based (it needs a human-written gold answer to compare against) or reference-free (it scores the output directly). N-gram and embedding metrics are reference-based; LLM-as-judge and several RAG metrics can operate reference-free. Reference-based metrics are cheaper and more objective but demand costly labelled data and penalise valid variation; reference-free metrics scale to open-ended tasks but need careful calibration.

Reference-based n-gram overlap

These metrics compare a candidate output against one or more human reference texts by counting overlapping words or n-grams.

  • BLEU measures precision of overlapping n-grams; it was designed for machine translation and rewards outputs that reuse reference wording.
  • ROUGE emphasises recall of overlapping n-grams and longest common subsequences; it is common for summarisation.
  • METEOR aligns candidate and reference tokens and credits synonyms, stemming, and word order, partially addressing BLEU's rigidity.

Concretely, BLEU combines modified n-gram precision (typically up to 4-grams) with a brevity penalty that discourages overly short outputs, while ROUGE is usually reported as ROUGE-1, ROUGE-2, and ROUGE-L (longest common subsequence). All three assume the reference text is a reasonable gold standard, which breaks down when there are many equally valid answers.

Their appeal is that they are fast, cheap, deterministic, and reproducible. Their limitation is that they reward surface overlap, not meaning: a correct answer phrased differently from the reference scores poorly, while a fluent-but-wrong answer that echoes reference words can score well. For open-ended generation — creative writing, chat, reasoning — where many valid outputs exist and no single reference is complete, n-gram metrics correlate weakly with human judgement.

Embedding and semantic similarity

To capture meaning rather than exact wording, semantic metrics compare vector representations of texts.

  • Cosine similarity over sentence or document embeddings measures how close two texts sit in vector space, so paraphrases with different words can still score highly.
  • BERTScore matches contextual token embeddings between candidate and reference and reports precision, recall, and F1, giving credit for semantically equivalent tokens.

These handle paraphrase far better than n-gram overlap and are useful when the meaning must match a reference. Their limits: scores depend on the embedding model chosen, they can rate topically-similar-but-factually-wrong text as similar, and they still require a reference. High similarity does not prove factual correctness.

LLM-as-a-judge

A strong model can be prompted to score another model's output, either against a rubric or by comparing two candidates.

  • Pairwise comparison asks the judge which of two responses is better — robust for ranking competing systems.
  • Rubric or direct scoring asks the judge to rate an output on defined criteria (helpfulness, correctness, safety), often on a numeric scale.

LLM-as-judge is reference-free, flexible, and correlates better with human preference on open-ended tasks than n-gram metrics. But it carries real biases the tester must control: position bias (favouring the first or second option), verbosity bias (preferring longer answers), self-preference (favouring outputs from the same model family), and sensitivity to prompt wording. Mitigations include swapping option order, fixing clear rubrics, using few-shot calibration, and validating the judge against human labels. It also costs more and is itself non-deterministic.

Metric familyWhat it measuresKey limitation
BLEU / ROUGE / METEORN-gram overlap with a reference textRewards surface wording, misses meaning; weak for open-ended tasks
Cosine / BERTScoreSemantic similarity via embeddingsNeeds a reference; similarity is not correctness; model-dependent
LLM-as-judgeModel-scored quality vs rubric or pairwisePosition/verbosity/self-preference bias; costly; non-deterministic

Choosing the right metric

  • Use n-gram metrics when there is a close, canonical reference (translation, tightly-specified summaries) and you need cheap regression signals.
  • Use semantic similarity when meaning must match but wording can vary, or to detect drift against a golden answer.
  • Use LLM-as-judge for open-ended quality, subjective criteria, or when references are unavailable — but calibrate and audit it.

The practical guidance is triangulation: combine a fast deterministic metric for regressions, a semantic metric for meaning, and judged or human review for the dimensions that matter most. Every metric is a proxy, so testers report score distributions and track them over time rather than treating a single value as ground truth.

Statistical rigour

Whatever the metric, a single score on a handful of examples is noise. Testers should evaluate on a sufficiently large, representative sample, report the score distribution or a confidence interval, and compare candidate systems on the same fixed dataset. Small differences between two systems may not be meaningful, so tracking a metric's trend across builds — and confirming that a change is larger than run-to-run variance — is more informative than any single absolute value.

Test Your Knowledge

A team scores an open-ended chatbot's answers with BLEU against a single reference answer and is surprised that clearly correct, well-phrased responses receive low scores. What is the best explanation?

A
B
C
D
Test Your Knowledge

Which limitation is specific to using LLM-as-a-judge for scoring outputs?

A
B
C
D
Test Your Knowledge

A tester needs to check that model answers still mean the same thing as a golden reference even when the wording changes, without paying for an LLM judge on every run. Which metric is the most appropriate choice?

A
B
C
D