A tester asks 'What is the capital of France?' then 'Which city is France's capital?' and requires both answers to mean the same thing. Which technique is this?

A metamorphic paraphrase-invariance relation. This is the paraphrase-invariance metamorphic relation from the section: the prompt is reworded with the same intent, and the two outputs must be semantically equal. It needs no golden answer, only the expected relation between the outputs.

Inserting the neutral sentence 'It rained yesterday.' before a factual question and requiring the factual answer to stay unchanged tests which property?

Irrelevant-context invariance — robustness to distraction. The section lists irrelevant-context invariance: adding a neutral, unrelated sentence should not change a robust system's factual answer. A changed answer reveals susceptibility to distraction.

What does a perturbation test's 'robustness rate' measure?

The proportion of small, meaning-preserving input edits whose output stays within tolerance of the original. As defined in the perturbation section, the robustness rate is the proportion of perturbed inputs (typos, synonyms, casing changes) whose output stays within tolerance of the original answer. A model that flips on a single typo has a low robustness rate.

Perturbation & metamorphic testing — Free Study Guide 2026

Key Takeaways

Metamorphic and perturbation testing sidestep the oracle problem by checking relationships between related inputs' outputs, not one output against a known answer.
A metamorphic relation links a source input/output to a transformed follow-up; the outputs must satisfy an expected relation (invariance or a directional change).
Common invariance relations: paraphrasing the prompt, adding irrelevant context, or reordering list items should not change a factual answer or decision.
Perturbation testing applies small meaning-preserving edits (typos, synonyms, casing) and asserts output stability, measured as a robustness rate.
Both techniques need no hand-labelled oracle and pair with the golden dataset; apply one transformation at a time and log every failing pair as a regression case.

The oracle problem and how to sidestep it

Section 3.1 assumed you already had a reference answer to compare against. Often you do not: nobody has hand-labelled the "correct" output for every possible input, and for open-ended generation a single ground truth may not even exist. This is the oracle problem. Metamorphic and perturbation testing sidestep it by checking the relationship between outputs across related inputs. You do not need to know the right answer; you only need to know how the answer should change, or stay the same, when you change the input in a defined way.

Metamorphic testing

A metamorphic relation (MR) is a rule that links a source input/output pair to a follow-up input/output pair. You transform the source input in a controlled way, run the model again, and assert that the two outputs satisfy the expected relation. If they do not, you have found a defect — without ever writing a golden answer.

Common relations for GenAI fall into two families:

Invariance relations — the transformation should not change the meaning of the output.
Directional / equivalence relations — the transformation should change the output in a predictable direction, or preserve a specific property.

Worked examples make this concrete:

Paraphrase invariance. Source: "What is the capital of France?" Follow-up: "Which city is France's capital?" The factual answer must stay semantically equal ("Paris"). If the paraphrase flips the answer, the model is unstable to surface wording.
Irrelevant-context invariance. Insert a neutral, unrelated sentence ("It rained yesterday.") before a factual question. A robust system's factual answer should not change; a changed answer reveals susceptibility to distraction.
Negation / antonym relation. Change "list the advantages of X" to "list the disadvantages of X"; the follow-up output should be substantively different, not a copy. Sameness signals the model ignored the semantic flip.
Translation round-trip. Translate English to German to English; the meaning should be preserved, so large drift flags a translation defect.
Ordering invariance. For "summarise these three reviews," reordering the reviews should not materially change the summary's sentiment or key points.

The metamorphic-relations table below is the kind of artefact you build during test design:

Metamorphic relation	Input transformation	Expected relation on output
Paraphrase invariance	Reword the prompt, same intent	Output meaning unchanged
Irrelevant-context invariance	Insert a neutral sentence	Factual answer unchanged
Negation	Flip the ask (advantages to disadvantages)	Output substantively different
Case / format invariance	Change casing or spacing	Answer unchanged
Translation round-trip	Language A to B to A	Meaning preserved
Ordering invariance	Reorder list items	Summary or decision stays stable

Because outputs are non-deterministic, the assertion on the relation is itself tolerance-based: "semantically equal" means similarity above a threshold, not string identity, reusing the semantic oracles from Section 3.1.

Perturbation testing

Perturbation testing applies small, meaning-preserving changes to an input and checks that behaviour stays stable — it is the robustness lens on the same idea. Where a metamorphic relation defines an expected output relationship, a perturbation test typically asserts stability: a tiny edit should not cause a large output swing.

Typical perturbations include:

Typos and misspellings — "recieve", "teh" — the answer should be unaffected.
Synonym substitution — "big" to "large", "buy" to "purchase".
Whitespace / punctuation — extra spaces or missing commas.
Word or clause reordering where meaning is preserved.
Case changes — ALL CAPS or lower case.

You measure a robustness rate: the proportion of perturbed inputs whose output stays within tolerance of the original. A model that answers correctly on clean text but flips on a single typo has a robustness defect — important because real users type messily. Perturbation suites pair naturally with the golden dataset: take each golden input, generate a family of perturbations, and require every output to remain within the acceptable band.

Designing the campaign

Good practice is to start from real or representative source inputs, apply one transformation at a time so a failure is diagnosable, automate generation so hundreds of variants are cheap, and log every failing pair as a new regression case. Both techniques scale precisely because they need no hand-labelled oracle — the relation, not the answer, is the specification. This makes them the workhorse of GenAI robustness testing and a direct complement to the golden-dataset regression baseline from the previous section.

Choosing thresholds and avoiding false alarms

The hard part is calibrating the tolerance. Set it too tight and legitimate paraphrasing trips the test (a false positive); set it too loose and a genuine meaning change slips through (a false negative). Tune the similarity threshold empirically against a small labelled sample of known-good and known-bad output pairs, then hold it stable so results stay comparable across runs. Watch for transformations that should change the answer but that your relation wrongly marks as invariant — an "irrelevant" sentence that is actually relevant, or a synonym that shifts nuance. When a metamorphic or perturbation test fails, first confirm the relation still holds semantically before filing a defect, so you triage flaky oracles separately from real robustness bugs. Track invariance failures by transformation type to see which perturbation class the model is weakest against.

ISTQB Certified Tester — Testing with Generative AI

ISTQB Generative AI Testing Specialist (CT-GenAI)

3.2 Perturbation & metamorphic testing

Key Takeaways

The oracle problem and how to sidestep it

Metamorphic testing

Perturbation testing

Designing the campaign

Choosing thresholds and avoiding false alarms

ISTQB Certified Tester — Testing with Generative AI

1GenAI Foundations for Testers

2Quality Attributes for GenAI

3Test Design for Non-Determinism

4GenAI Risks & Mitigation

5Test Infrastructure & Tooling

6Organizational Adoption

ISTQB Generative AI Testing Specialist (CT-GenAI)

3.2 Perturbation & metamorphic testing

Key Takeaways

The oracle problem and how to sidestep it

Metamorphic testing

Perturbation testing

Designing the campaign

Choosing thresholds and avoiding false alarms