1.2 Foundation models, temperature/sampling & prompting basics

Key Takeaways

  • A foundation model is broadly pre-trained, and teams adapt it either by fine-tuning, which changes the weights, or by prompting, which leaves the weights unchanged.
  • Temperature 0 makes generation effectively greedy and the most deterministic, so functional tests should pin temperature low to maximise reproducibility.
  • Higher temperature and wider top-k or top-p increase randomness and diversity, which helps robustness testing but raises the risk of hallucination.
  • The max tokens setting caps output length, and setting it too low truncates answers in ways easily mistaken for a genuine model defect.
  • System prompts set persistent rules while user prompts carry the request, and few-shot prompting adds examples to improve output consistency and format.
Last updated: July 2026

Foundation models and how testers adapt them

Pre-training, foundation models and fine-tuning

A foundation model (or pre-trained model) is a large model trained on broad data to acquire general language ability; such models power most GenAI products under test. Because pre-training is expensive, teams rarely train from scratch. Instead they adapt a foundation model in one of two broad ways, and the choice changes what and how you test.

  • Fine-tuning continues training the model on a narrower, task-specific dataset, changing the model's weights. It can improve accuracy on specialized tasks but risks catastrophic forgetting of earlier skills, data leakage and overfitting, so regression testing across the original capabilities becomes essential.
  • Prompting (in-context learning) leaves the weights unchanged and steers the model at inference time with instructions and examples. It is cheaper and faster to iterate, but behavior is sensitive to wording, so tests must treat the prompt itself as a versioned, testable artifact.

Testers should know which approach a system uses because it determines where defects live: a fine-tuned model needs data-quality and regression checks, while a prompted system needs prompt-robustness and injection testing. A third pattern, retrieval-augmented generation, adds no training at all and instead supplies knowledge at inference time; it is covered in the next section. Many production systems combine these, so a tester should map exactly which knowledge is baked into weights, which is prompted, and which is retrieved before writing test cases.

Temperature and sampling: the reproducibility levers

When a model generates text it produces a probability distribution over the next token and then samples from it. These sampling settings are the tester's main levers for determinism and reproducibility.

  • Temperature rescales the probabilities. A temperature of 0 makes generation effectively greedy — it selects the highest-probability token every time, giving the most deterministic and repeatable output. Higher temperatures (for example 0.7 to 1.0 and above) flatten the distribution, increasing randomness, diversity and "creativity" — and the chance of hallucination.
  • Top-k sampling restricts the choice to the k most likely tokens before sampling from them.
  • Top-p (nucleus) sampling keeps the smallest set of tokens whose cumulative probability reaches p (for example 0.9), adapting the candidate pool to the model's confidence.
  • max tokens caps the length of the generated output; setting it too low causes truncated, incomplete answers — a defect often mistaken for a model failure.

For test design the guidance is clear. To make functional tests reproducible, set temperature to 0 (or the lowest available) and pin the model version and other sampling parameters. Note, however, that even at temperature 0 outputs are not guaranteed to be bit-identical across model updates or hardware, so assertions should target meaning and constraints, not exact strings. To test robustness, diversity, or the range of possible outputs, deliberately raise the temperature and run repeated trials to observe variance. A practical technique is to run the same case many times at a fixed temperature and measure how often the output satisfies the assertion; a flaky rate above your threshold is itself a reportable defect. Temperature, top-k and top-p are not independent knobs to change casually — each is part of the reproducible configuration that must be recorded alongside the model version for every test result.

Sampling parameters at a glance

ParameterWhat it controlsEffect on determinism
Temperature = 0Greedy selection of the most likely tokenMost deterministic; best for reproducible tests
Temperature highFlattens the probability distributionLess deterministic; more diverse and creative
Top-kLimits the pool to the k likeliest tokensSmaller k = more focused and repeatable
Top-p (nucleus)Smallest pool reaching cumulative probability pLower p = more constrained output
max tokensMaximum output lengthNo effect on randomness; a low value truncates

Prompting basics testers must control

A prompt is the input that conditions the model, and it is often the primary "code" under test. Testers need a working grasp of prompt structure.

  • System vs user prompts. The system prompt sets persistent role, rules and constraints; the user prompt carries the specific request. Many defects and prompt-injection vulnerabilities arise when user input overrides system instructions, so test whether system rules hold under adversarial user input.
  • Zero-shot prompting gives an instruction with no examples. Few-shot prompting includes a handful of input-output examples to demonstrate the desired format or behavior, usually improving consistency. The number and quality of those examples is itself a test variable, and poorly chosen examples can bias the model toward a narrow, unrepresentative pattern.
  • Prompt structure. A clear role, task, context, constraints and output format reduce ambiguity and make outputs easier to assert against. Requesting a structured format such as JSON makes automated verification far easier.

Because small wording changes shift behavior, prompts must be version-controlled and treated as testable artifacts. A robust GenAI test suite records the exact prompt, model version and sampling settings for every case so that results are traceable and defects are reproducible.

Test Your Knowledge

A tester needs functional test cases against an LLM to be as reproducible as possible. Which setting most directly supports that goal?

A
B
C
D
Test Your Knowledge

Which statement about fine-tuning versus prompting is most accurate for a tester planning coverage?

A
B
C
D