3.2 Perturbation & metamorphic testing
Key Takeaways
- Metamorphic and perturbation testing sidestep the oracle problem by checking relationships between related inputs' outputs, not one output against a known answer.
- A metamorphic relation links a source input/output to a transformed follow-up; the outputs must satisfy an expected relation (invariance or a directional change).
- Common invariance relations: paraphrasing the prompt, adding irrelevant context, or reordering list items should not change a factual answer or decision.
- Perturbation testing applies small meaning-preserving edits (typos, synonyms, casing) and asserts output stability, measured as a robustness rate.
- Both techniques need no hand-labelled oracle and pair with the golden dataset; apply one transformation at a time and log every failing pair as a regression case.
The oracle problem and how to sidestep it
Section 3.1 assumed you already had a reference answer to compare against. Often you do not: nobody has hand-labelled the "correct" output for every possible input, and for open-ended generation a single ground truth may not even exist. This is the oracle problem. Metamorphic and perturbation testing sidestep it by checking the relationship between outputs across related inputs. You do not need to know the right answer; you only need to know how the answer should change, or stay the same, when you change the input in a defined way.
Metamorphic testing
A metamorphic relation (MR) is a rule that links a source input/output pair to a follow-up input/output pair. You transform the source input in a controlled way, run the model again, and assert that the two outputs satisfy the expected relation. If they do not, you have found a defect — without ever writing a golden answer.
Common relations for GenAI fall into two families:
- Invariance relations — the transformation should not change the meaning of the output.
- Directional / equivalence relations — the transformation should change the output in a predictable direction, or preserve a specific property.
Worked examples make this concrete:
- Paraphrase invariance. Source: "What is the capital of France?" Follow-up: "Which city is France's capital?" The factual answer must stay semantically equal ("Paris"). If the paraphrase flips the answer, the model is unstable to surface wording.
- Irrelevant-context invariance. Insert a neutral, unrelated sentence ("It rained yesterday.") before a factual question. A robust system's factual answer should not change; a changed answer reveals susceptibility to distraction.
- Negation / antonym relation. Change "list the advantages of X" to "list the disadvantages of X"; the follow-up output should be substantively different, not a copy. Sameness signals the model ignored the semantic flip.
- Translation round-trip. Translate English to German to English; the meaning should be preserved, so large drift flags a translation defect.
- Ordering invariance. For "summarise these three reviews," reordering the reviews should not materially change the summary's sentiment or key points.
A tester asks 'What is the capital of France?' then 'Which city is France's capital?' and requires both answers to mean the same thing. Which technique is this?
Inserting the neutral sentence 'It rained yesterday.' before a factual question and requiring the factual answer to stay unchanged tests which property?
What does a perturbation test's 'robustness rate' measure?
The metamorphic-relations table below is the kind of artefact you build during test design:
| Metamorphic relation | Input transformation | Expected relation on output |
|---|---|---|
| Paraphrase invariance | Reword the prompt, same intent | Output meaning unchanged |
| Irrelevant-context invariance | Insert a neutral sentence | Factual answer unchanged |
| Negation | Flip the ask (advantages to disadvantages) | Output substantively different |
| Case / format invariance | Change casing or spacing | Answer unchanged |
| Translation round-trip | Language A to B to A | Meaning preserved |
| Ordering invariance | Reorder list items | Summary or decision stays stable |
Because outputs are non-deterministic, the assertion on the relation is itself tolerance-based: "semantically equal" means similarity above a threshold, not string identity, reusing the semantic oracles from Section 3.1.
Perturbation testing
Perturbation testing applies small, meaning-preserving changes to an input and checks that behaviour stays stable — it is the robustness lens on the same idea. Where a metamorphic relation defines an expected output relationship, a perturbation test typically asserts stability: a tiny edit should not cause a large output swing.
Typical perturbations include:
- Typos and misspellings — "recieve", "teh" — the answer should be unaffected.
- Synonym substitution — "big" to "large", "buy" to "purchase".
- Whitespace / punctuation — extra spaces or missing commas.
- Word or clause reordering where meaning is preserved.
- Case changes — ALL CAPS or lower case.
You measure a robustness rate: the proportion of perturbed inputs whose output stays within tolerance of the original. A model that answers correctly on clean text but flips on a single typo has a robustness defect — important because real users type messily. Perturbation suites pair naturally with the golden dataset: take each golden input, generate a family of perturbations, and require every output to remain within the acceptable band.
Designing the campaign
Good practice is to start from real or representative source inputs, apply one transformation at a time so a failure is diagnosable, automate generation so hundreds of variants are cheap, and log every failing pair as a new regression case. Both techniques scale precisely because they need no hand-labelled oracle — the relation, not the answer, is the specification. This makes them the workhorse of GenAI robustness testing and a direct complement to the golden-dataset regression baseline from the previous section.
Choosing thresholds and avoiding false alarms
The hard part is calibrating the tolerance. Set it too tight and legitimate paraphrasing trips the test (a false positive); set it too loose and a genuine meaning change slips through (a false negative). Tune the similarity threshold empirically against a small labelled sample of known-good and known-bad output pairs, then hold it stable so results stay comparable across runs. Watch for transformations that should change the answer but that your relation wrongly marks as invariant — an "irrelevant" sentence that is actually relevant, or a synonym that shifts nuance. When a metamorphic or perturbation test fails, first confirm the relation still holds semantically before filing a defect, so you triage flaky oracles separately from real robustness bugs. Track invariance failures by transformation type to see which perturbation class the model is weakest against.