3.2 Perturbation & metamorphic testing

Key Takeaways

  • Metamorphic and perturbation testing sidestep the oracle problem by checking relationships between related inputs' outputs, not one output against a known answer.
  • A metamorphic relation links a source input/output to a transformed follow-up; the outputs must satisfy an expected relation (invariance or a directional change).
  • Common invariance relations: paraphrasing the prompt, adding irrelevant context, or reordering list items should not change a factual answer or decision.
  • Perturbation testing applies small meaning-preserving edits (typos, synonyms, casing) and asserts output stability, measured as a robustness rate.
  • Both techniques need no hand-labelled oracle and pair with the golden dataset; apply one transformation at a time and log every failing pair as a regression case.
Last updated: July 2026

The oracle problem and how to sidestep it

Section 3.1 assumed you already had a reference answer to compare against. Often you do not: nobody has hand-labelled the "correct" output for every possible input, and for open-ended generation a single ground truth may not even exist. This is the oracle problem. Metamorphic and perturbation testing sidestep it by checking the relationship between outputs across related inputs. You do not need to know the right answer; you only need to know how the answer should change, or stay the same, when you change the input in a defined way.

Metamorphic testing

A metamorphic relation (MR) is a rule that links a source input/output pair to a follow-up input/output pair. You transform the source input in a controlled way, run the model again, and assert that the two outputs satisfy the expected relation. If they do not, you have found a defect — without ever writing a golden answer.

Common relations for GenAI fall into two families:

  • Invariance relations — the transformation should not change the meaning of the output.
  • Directional / equivalence relations — the transformation should change the output in a predictable direction, or preserve a specific property.

Worked examples make this concrete:

  • Paraphrase invariance. Source: "What is the capital of France?" Follow-up: "Which city is France's capital?" The factual answer must stay semantically equal ("Paris"). If the paraphrase flips the answer, the model is unstable to surface wording.
  • Irrelevant-context invariance. Insert a neutral, unrelated sentence ("It rained yesterday.") before a factual question. A robust system's factual answer should not change; a changed answer reveals susceptibility to distraction.
  • Negation / antonym relation. Change "list the advantages of X" to "list the disadvantages of X"; the follow-up output should be substantively different, not a copy. Sameness signals the model ignored the semantic flip.
  • Translation round-trip. Translate English to German to English; the meaning should be preserved, so large drift flags a translation defect.
  • Ordering invariance. For "summarise these three reviews," reordering the reviews should not materially change the summary's sentiment or key points.
Test Your Knowledge

A tester asks 'What is the capital of France?' then 'Which city is France's capital?' and requires both answers to mean the same thing. Which technique is this?

A
B
C
D
Test Your Knowledge

Inserting the neutral sentence 'It rained yesterday.' before a factual question and requiring the factual answer to stay unchanged tests which property?

A
B
C
D
Test Your Knowledge

What does a perturbation test's 'robustness rate' measure?

A
B
C
D

The metamorphic-relations table below is the kind of artefact you build during test design:

Metamorphic relationInput transformationExpected relation on output
Paraphrase invarianceReword the prompt, same intentOutput meaning unchanged
Irrelevant-context invarianceInsert a neutral sentenceFactual answer unchanged
NegationFlip the ask (advantages to disadvantages)Output substantively different
Case / format invarianceChange casing or spacingAnswer unchanged
Translation round-tripLanguage A to B to AMeaning preserved
Ordering invarianceReorder list itemsSummary or decision stays stable

Because outputs are non-deterministic, the assertion on the relation is itself tolerance-based: "semantically equal" means similarity above a threshold, not string identity, reusing the semantic oracles from Section 3.1.

Perturbation testing

Perturbation testing applies small, meaning-preserving changes to an input and checks that behaviour stays stable — it is the robustness lens on the same idea. Where a metamorphic relation defines an expected output relationship, a perturbation test typically asserts stability: a tiny edit should not cause a large output swing.

Typical perturbations include:

  • Typos and misspellings — "recieve", "teh" — the answer should be unaffected.
  • Synonym substitution — "big" to "large", "buy" to "purchase".
  • Whitespace / punctuation — extra spaces or missing commas.
  • Word or clause reordering where meaning is preserved.
  • Case changes — ALL CAPS or lower case.

You measure a robustness rate: the proportion of perturbed inputs whose output stays within tolerance of the original. A model that answers correctly on clean text but flips on a single typo has a robustness defect — important because real users type messily. Perturbation suites pair naturally with the golden dataset: take each golden input, generate a family of perturbations, and require every output to remain within the acceptable band.

Designing the campaign

Good practice is to start from real or representative source inputs, apply one transformation at a time so a failure is diagnosable, automate generation so hundreds of variants are cheap, and log every failing pair as a new regression case. Both techniques scale precisely because they need no hand-labelled oracle — the relation, not the answer, is the specification. This makes them the workhorse of GenAI robustness testing and a direct complement to the golden-dataset regression baseline from the previous section.

Choosing thresholds and avoiding false alarms

The hard part is calibrating the tolerance. Set it too tight and legitimate paraphrasing trips the test (a false positive); set it too loose and a genuine meaning change slips through (a false negative). Tune the similarity threshold empirically against a small labelled sample of known-good and known-bad output pairs, then hold it stable so results stay comparable across runs. Watch for transformations that should change the answer but that your relation wrongly marks as invariant — an "irrelevant" sentence that is actually relevant, or a synonym that shifts nuance. When a metamorphic or perturbation test fails, first confirm the relation still holds semantically before filing a defect, so you triage flaky oracles separately from real robustness bugs. Track invariance failures by transformation type to see which perturbation class the model is weakest against.