3.1 Controlling non-determinism (seeds, temperature, golden datasets)

Key Takeaways

  • GenAI is non-deterministic: the same prompt can yield different valid outputs, so exact-match assertions produce flaky failures rather than real defects.
  • Reduce test variance with temperature 0 (greedy decoding), a fixed seed where supported, and a pinned dated model version instead of a 'latest' alias.
  • Replace exact-match oracles with tolerance-based checks: semantic-similarity thresholds, fact presence, schema validation, and LLM-as-judge scoring.
  • A golden dataset is a curated input to reference-output test set scored in aggregate (e.g. at least 95% of cases pass) as the regression baseline.
  • Re-run and version the golden dataset on every prompt, model, parameter, or retrieval change; a drop in aggregate score signals a regression.
Last updated: July 2026

Why generative AI is non-deterministic

A conventional program is deterministic: the same input produces the same output every time, so a tester can assert assertEquals(expected, actual) with confidence. Large language models break this contract. A GenAI system samples its next token from a probability distribution, so the same prompt submitted twice can return two different, yet both acceptable, answers. "Summarise this paragraph" might yield a five-word summary on one run and a two-sentence summary on the next. Sources of variance include probabilistic sampling, floating-point non-associativity across GPUs, load-balanced routing to different hardware, and silent vendor model updates behind the same API endpoint.

This non-determinism is exactly why classic expected-value assertions fail. An exact string comparison against a single reference answer reports a false failure whenever the model phrases a correct answer differently. The test is flaky not because the code is broken, but because the oracle is too strict. The tester's job therefore shifts from asking "is the output identical to the expected string?" to "is the output acceptable within a defined tolerance?"

Reducing variance so tests become repeatable

Before you can measure quality reliably you must shrink the noise. Several controls make a GenAI system as reproducible as possible:

  • Temperature 0 / greedy decoding. Temperature scales the sampling randomness. At temperature 0 the model always picks the highest-probability token (greedy decoding), which minimises run-to-run variation. Use it for regression tests where you want the most stable output. It does not guarantee bit-identical results across hardware, and it can mask output-diversity issues you may want to test separately.
  • Fixed seed where supported. Some APIs accept a random-seed parameter; pinning it makes sampling reproducible for a given backend. Treat it as best-effort, because providers often document the seed as non-guaranteed.
  • Pinned model version. Always test against an explicit, dated model snapshot (for example model-2026-05) rather than a floating "latest" alias. A silent upgrade is the single most common cause of a suite that passed yesterday and fails today with no code change.
  • Fixed prompt template, parameters, and context. Hold the system prompt, max-tokens, top-p, and any retrieved context constant so that the only variable under test is the one you intend to exercise.
Test Your Knowledge

Why do exact-match assertions (assertEquals against one reference string) frequently produce false failures when testing a GenAI system?

A
B
C
D
Test Your Knowledge

Which combination most improves run-to-run reproducibility of a GenAI regression test?

A
B
C
D
Test Your Knowledge

What best describes a golden dataset in GenAI testing?

A
B
C
D

From exact-match to tolerance-based and semantic assertions

Even with variance controlled, you rarely get one canonical string, so the oracle itself changes shape. Instead of exact match, GenAI assertions accept a band of correct behaviour:

  • Semantic similarity — embed the output and the reference, and pass if cosine similarity exceeds a threshold.
  • Keyword / fact presence — assert that required facts appear, regardless of wording.
  • Structural / schema checks — for JSON output, validate the schema deterministically even when field values vary.
  • LLM-as-judge — a separate model scores the output against a rubric (correctness, relevance, tone) and returns a pass/fail verdict or a graded score.

The table below contrasts the two worlds a tester must bridge.

AspectDeterministic softwareGenAI system
OracleSingle expected valueRange of acceptable outputs
AssertionassertEquals(exact)Similarity threshold, fact-presence, LLM-as-judge
RepeatabilityGuaranteedBest-effort (temp 0, seed, pinned version)
A failing test meansDefect in codeFlaky oracle, drift, or a real defect
Pass criterionBinary exact matchTolerance band or graded score

Golden datasets as the regression backbone

A golden dataset is a curated collection of representative input → reference-output pairs that acts as the system's regression baseline. Building one is itself a test-design activity: cover typical cases, edge cases, and known past failures, and label each with an acceptable answer or a scoring rubric rather than a single rigid string.

To run it, feed every input through the pinned model at temperature 0, then score each output — by semantic similarity, fact-presence rules, or LLM-as-judge — and aggregate the scores into a suite-level pass rate. Because individual outputs vary, you track an aggregate metric (for example "at least 95% of golden cases score 0.8 or higher") rather than demanding every single case be perfect. Re-run the golden dataset whenever the prompt, model version, retrieval source, or parameters change; a drop in the aggregate score signals a regression. Version the dataset alongside the code, and expand it every time production surfaces a new failure, so the golden set grows into an ever-tightening safety net.

Sampling multiple runs

Because a single run is only one sample, not a verdict, a robust GenAI test often runs the same input N times and asserts on the distribution of outcomes rather than on one result. You might require that at least 8 of 10 runs pass the semantic check, or that the worst-case run still avoids a critical error, or that the pass rate stays above a threshold. This statistical approach exposes intermittent defects that a single temperature-0 pass would hide, and it lets you test deliberately at higher temperatures where output diversity is itself a desired feature (for example brainstorming or creative writing). The trade-off is cost and latency, so reserve multi-run sampling for high-risk or non-deterministic-by-design cases, and keep cheap temperature-0 single runs for broad regression coverage. Report results as rates, never as a single pass/fail, so stakeholders read reliability honestly.