Why do exact-match assertions (assertEquals against one reference string) frequently produce false failures when testing a GenAI system?

Because a non-deterministic model can express the same correct answer with different wording, so a single rigid string is too strict an oracle. As the section explains, GenAI samples tokens probabilistically, so the same prompt can return different yet acceptable phrasings. An exact string oracle then reports a false failure whenever wording differs; the fix is tolerance-based assertions, not a stricter string match.

Which combination most improves run-to-run reproducibility of a GenAI regression test?

Temperature 0 (greedy decoding), a fixed seed where supported, and a pinned dated model version. The variance-reduction controls listed are greedy decoding at temperature 0, a fixed seed where the API supports it, and a pinned dated model snapshot. Option B does the opposite of each, and a floating 'latest' alias is named as the most common cause of surprise failures.

What best describes a golden dataset in GenAI testing?

A curated set of input to reference-output pairs scored in aggregate as a regression baseline. The section defines a golden dataset as curated input-to-reference-output pairs covering typical, edge, and past-failure cases, scored in aggregate (e.g. at least 95% pass) and re-run on every change as the regression baseline.

Controlling non-determinism — Free Study Guide 2026

Key Takeaways

GenAI is non-deterministic: the same prompt can yield different valid outputs, so exact-match assertions produce flaky failures rather than real defects.
Reduce test variance with temperature 0 (greedy decoding), a fixed seed where supported, and a pinned dated model version instead of a 'latest' alias.
Replace exact-match oracles with tolerance-based checks: semantic-similarity thresholds, fact presence, schema validation, and LLM-as-judge scoring.
A golden dataset is a curated input to reference-output test set scored in aggregate (e.g. at least 95% of cases pass) as the regression baseline.
Re-run and version the golden dataset on every prompt, model, parameter, or retrieval change; a drop in aggregate score signals a regression.

Why generative AI is non-deterministic

A conventional program is deterministic: the same input produces the same output every time, so a tester can assert assertEquals(expected, actual) with confidence. Large language models break this contract. A GenAI system samples its next token from a probability distribution, so the same prompt submitted twice can return two different, yet both acceptable, answers. "Summarise this paragraph" might yield a five-word summary on one run and a two-sentence summary on the next. Sources of variance include probabilistic sampling, floating-point non-associativity across GPUs, load-balanced routing to different hardware, and silent vendor model updates behind the same API endpoint.

This non-determinism is exactly why classic expected-value assertions fail. An exact string comparison against a single reference answer reports a false failure whenever the model phrases a correct answer differently. The test is flaky not because the code is broken, but because the oracle is too strict. The tester's job therefore shifts from asking "is the output identical to the expected string?" to "is the output acceptable within a defined tolerance?"

Reducing variance so tests become repeatable

Before you can measure quality reliably you must shrink the noise. Several controls make a GenAI system as reproducible as possible:

Temperature 0 / greedy decoding. Temperature scales the sampling randomness. At temperature 0 the model always picks the highest-probability token (greedy decoding), which minimises run-to-run variation. Use it for regression tests where you want the most stable output. It does not guarantee bit-identical results across hardware, and it can mask output-diversity issues you may want to test separately.
Fixed seed where supported. Some APIs accept a random-seed parameter; pinning it makes sampling reproducible for a given backend. Treat it as best-effort, because providers often document the seed as non-guaranteed.
Pinned model version. Always test against an explicit, dated model snapshot (for example model-2026-05) rather than a floating "latest" alias. A silent upgrade is the single most common cause of a suite that passed yesterday and fails today with no code change.
Fixed prompt template, parameters, and context. Hold the system prompt, max-tokens, top-p, and any retrieved context constant so that the only variable under test is the one you intend to exercise.

From exact-match to tolerance-based and semantic assertions

Even with variance controlled, you rarely get one canonical string, so the oracle itself changes shape. Instead of exact match, GenAI assertions accept a band of correct behaviour:

Semantic similarity — embed the output and the reference, and pass if cosine similarity exceeds a threshold.
Keyword / fact presence — assert that required facts appear, regardless of wording.
Structural / schema checks — for JSON output, validate the schema deterministically even when field values vary.
LLM-as-judge — a separate model scores the output against a rubric (correctness, relevance, tone) and returns a pass/fail verdict or a graded score.

The table below contrasts the two worlds a tester must bridge.

Aspect	Deterministic software	GenAI system
Oracle	Single expected value	Range of acceptable outputs
Assertion	`assertEquals(exact)`	Similarity threshold, fact-presence, LLM-as-judge
Repeatability	Guaranteed	Best-effort (temp 0, seed, pinned version)
A failing test means	Defect in code	Flaky oracle, drift, or a real defect
Pass criterion	Binary exact match	Tolerance band or graded score

Golden datasets as the regression backbone

A golden dataset is a curated collection of representative input → reference-output pairs that acts as the system's regression baseline. Building one is itself a test-design activity: cover typical cases, edge cases, and known past failures, and label each with an acceptable answer or a scoring rubric rather than a single rigid string.

To run it, feed every input through the pinned model at temperature 0, then score each output — by semantic similarity, fact-presence rules, or LLM-as-judge — and aggregate the scores into a suite-level pass rate. Because individual outputs vary, you track an aggregate metric (for example "at least 95% of golden cases score 0.8 or higher") rather than demanding every single case be perfect. Re-run the golden dataset whenever the prompt, model version, retrieval source, or parameters change; a drop in the aggregate score signals a regression. Version the dataset alongside the code, and expand it every time production surfaces a new failure, so the golden set grows into an ever-tightening safety net.

Sampling multiple runs

Because a single run is only one sample, not a verdict, a robust GenAI test often runs the same input N times and asserts on the distribution of outcomes rather than on one result. You might require that at least 8 of 10 runs pass the semantic check, or that the worst-case run still avoids a critical error, or that the pass rate stays above a threshold. This statistical approach exposes intermittent defects that a single temperature-0 pass would hide, and it lets you test deliberately at higher temperatures where output diversity is itself a desired feature (for example brainstorming or creative writing). The trade-off is cost and latency, so reserve multi-run sampling for high-risk or non-deterministic-by-design cases, and keep cheap temperature-0 single runs for broad regression coverage. Report results as rates, never as a single pass/fail, so stakeholders read reliability honestly.

ISTQB Certified Tester — Testing with Generative AI

ISTQB Generative AI Testing Specialist (CT-GenAI)

3.1 Controlling non-determinism (seeds, temperature, golden datasets)

Key Takeaways

Why generative AI is non-deterministic

Reducing variance so tests become repeatable

From exact-match to tolerance-based and semantic assertions

Golden datasets as the regression backbone

Sampling multiple runs

ISTQB Certified Tester — Testing with Generative AI

1GenAI Foundations for Testers

2Quality Attributes for GenAI

3Test Design for Non-Determinism

4GenAI Risks & Mitigation

5Test Infrastructure & Tooling

6Organizational Adoption

ISTQB Generative AI Testing Specialist (CT-GenAI)

3.1 Controlling non-determinism (seeds, temperature, golden datasets)

Key Takeaways

Why generative AI is non-deterministic

Reducing variance so tests become repeatable

From exact-match to tolerance-based and semantic assertions

Golden datasets as the regression backbone

Sampling multiple runs