A tester wants to run the same set of prompts against three different models and two prompt templates, then compare the outputs side by side in a config-driven file that can run in CI. Which tool best matches this need?

Promptfoo, because it is declarative and runs matrix evaluations of prompts and models. As the section states, Promptfoo is a declarative, configuration-driven tool that runs a matrix evaluation showing each prompt and model combination side by side, and suits lightweight CI checks. RAGAS is RAG-specific, TruthfulQA is a benchmark dataset (5.3), and BERTScore is a metric (5.2), not an evaluation harness.

A retrieval-augmented generation (RAG) pipeline returns answers that are fluent but sometimes not supported by the retrieved documents. Which framework and metric most directly targets whether the answer is grounded in the retrieved context?

RAGAS using its faithfulness metric. The section defines RAGAS faithfulness as measuring whether the answer is grounded in the retrieved context, which is exactly the described problem. Exact-match, latency tracing, and regex assertions do not test grounding against retrieved passages.

Why do evaluation frameworks typically report aggregate pass rates or score distributions rather than a single pass/fail from one execution?

Because generative model outputs are non-deterministic, so many cases must be aggregated for a reliable signal. The section's key testing insight is that outputs are non-deterministic, so evaluation reports aggregate pass rates or score distributions across many cases. Frameworks do support deterministic assertions and can store individual results, so the other options are false.

Evaluation frameworks — Free Study Guide 2026

Why evaluation frameworks exist

Testing a generative AI system is unlike testing deterministic software. The same prompt can yield different outputs across runs, models, and configuration changes, so there is rarely a single "expected result" to assert against. An evaluation framework is the tooling layer that makes this tractable. It lets a tester define test cases (prompts plus expectations), run those prompts across one or more models or prompt variants, apply assertions or scoring to the outputs, and track how results change over time. In effect these frameworks turn ad-hoc prompt experimentation into repeatable, version-controlled test suites that can run inside continuous integration (CI).

Core responsibilities

Regardless of vendor, an evaluation framework usually provides:

Test-case definition — a structured way to declare inputs (prompts, variables, datasets) and the expected behaviour or scoring criteria.
Matrix / multi-configuration execution — running the same cases across models, temperatures, or prompt templates to compare them side by side.
Assertions and scoring — deterministic checks (contains, regex, JSON schema) plus model-graded or metric-based scores.
Tracing and observability — capturing inputs, outputs, latency, token usage, and intermediate steps for debugging.
Regression tracking and CI integration — storing results so a change that degrades quality is caught before release.

A key testing insight: because outputs are non-deterministic, evaluation typically reports aggregate pass rates or score distributions across many cases rather than a single pass or fail on one run.

Evaluators: deterministic, statistical, and model-graded

Inside a framework, the component that turns an output into a score is called an evaluator (or grader), and three kinds recur. Deterministic evaluators apply exact rules — string equality, contains, regex, or JSON-schema validation — and are cheap and reproducible but only work when the correct answer is well defined. Statistical evaluators compute a metric such as n-gram overlap or embedding similarity against a reference (covered in 5.2). Model-graded evaluators use another LLM to judge the output against a rubric, which is flexible for open-ended tasks but slower and non-deterministic. A mature suite mixes all three, choosing the cheapest evaluator that still detects the failure you care about.

Good evaluation also depends on a curated dataset: a versioned set of representative inputs, ideally including known edge cases and past production incidents, with expected outputs or acceptance criteria. Because the dataset is the yardstick against which every model change is measured, it should be treated as a first-class, version-controlled test asset that grows as new failure modes are discovered.

The main frameworks and what they focus on

The syllabus expects familiarity with several representative tools. They overlap but emphasise different concerns.

Framework	Primary focus	Typical use
Promptfoo	Declarative prompt testing and side-by-side matrix evaluation	Compare prompts/models via a config file, assert on outputs, run in CI
LangSmith	Tracing plus evaluation of LLM applications	Debug multi-step chains/agents, run dataset evals, monitor production
OpenAI Evals	A registry/framework for defining and sharing evals	Author reusable eval templates and run them against models
RAGAS	RAG-specific quality metrics	Score retrieval-augmented pipelines on faithfulness and relevancy

Promptfoo is a declarative, configuration-driven tool. A tester writes prompts, providers (models), and test cases with assertions in a config file, then runs a matrix evaluation that shows each prompt and model combination together. This suits prompt comparison and lightweight CI checks where you assert that outputs contain, match, or are model-graded against expected criteria.

LangSmith centres on tracing and evaluation of LLM applications, especially multi-step chains and agents. It records the full execution trace so a tester can inspect intermediate calls, then run evaluations over curated datasets and monitor behaviour in production. Its strength is observability of complex pipelines, not only single-prompt scoring.

OpenAI Evals provides a framework and registry pattern for defining evaluations as reusable, shareable specifications. Testers compose eval templates (for example, exact-match or model-graded evals) and run them against models, which encourages a library of standardised tests instead of one-off scripts.

RAGAS is specialised for retrieval-augmented generation. Rather than general text assertions it computes RAG-oriented metrics such as faithfulness (is the answer grounded in the retrieved context?), answer relevancy (does the answer address the question?), and context precision / context recall (did retrieval return the right supporting passages?). This lets testers separate retrieval failures from generation failures.

Choosing and combining frameworks

These tools are not mutually exclusive. A common pattern uses a tracing-focused tool to observe an application, a declarative tool to run prompt regressions in CI, and a RAG-specific tool to score the retrieval layer. The tester's job is to map each framework's capabilities to the risks that matter: prompt drift, model upgrades, retrieval quality, or agent behaviour.

What frameworks do not solve

Frameworks provide the harness, not the judgement. They still depend on good test data, meaningful assertions, and appropriate metrics — a matrix run with weak assertions gives false confidence. Vendor feature sets change quickly, so testers should treat specific product claims as neutral capabilities to verify rather than guarantees, and anchor certification-level understanding on the category of work each tool performs rather than a marketing feature list.

ISTQB Certified Tester — Testing with Generative AI

ISTQB Generative AI Testing Specialist (CT-GenAI)

5.1 Evaluation frameworks (Promptfoo, LangSmith, OpenAI Evals, RAGAS)

Key Takeaways

Why evaluation frameworks exist

Core responsibilities

Evaluators: deterministic, statistical, and model-graded

The main frameworks and what they focus on

Choosing and combining frameworks

What frameworks do not solve

ISTQB Certified Tester — Testing with Generative AI

1GenAI Foundations for Testers

2Quality Attributes for GenAI

3Test Design for Non-Determinism

4GenAI Risks & Mitigation

5Test Infrastructure & Tooling

6Organizational Adoption

ISTQB Generative AI Testing Specialist (CT-GenAI)

5.1 Evaluation frameworks (Promptfoo, LangSmith, OpenAI Evals, RAGAS)

Key Takeaways

Why evaluation frameworks exist

Core responsibilities

Evaluators: deterministic, statistical, and model-graded

The main frameworks and what they focus on

Choosing and combining frameworks

What frameworks do not solve