5.1 Evaluation frameworks (Promptfoo, LangSmith, OpenAI Evals, RAGAS)
Key Takeaways
- An evaluation framework defines test cases, runs prompts across models/variants, applies assertions or scoring, and tracks regressions in CI.
- Because outputs are non-deterministic, frameworks report aggregate pass rates or score distributions across many cases, not a single pass/fail.
- Promptfoo is declarative and config-driven, running side-by-side matrix evaluations of prompts and models for comparison and CI.
- LangSmith emphasises tracing and evaluation of multi-step LLM apps/agents; OpenAI Evals offers a reusable eval registry pattern.
- RAGAS scores RAG pipelines with faithfulness, answer relevancy, and context precision/recall, separating retrieval from generation failures.
Why evaluation frameworks exist
Testing a generative AI system is unlike testing deterministic software. The same prompt can yield different outputs across runs, models, and configuration changes, so there is rarely a single "expected result" to assert against. An evaluation framework is the tooling layer that makes this tractable. It lets a tester define test cases (prompts plus expectations), run those prompts across one or more models or prompt variants, apply assertions or scoring to the outputs, and track how results change over time. In effect these frameworks turn ad-hoc prompt experimentation into repeatable, version-controlled test suites that can run inside continuous integration (CI).
Core responsibilities
Regardless of vendor, an evaluation framework usually provides:
- Test-case definition — a structured way to declare inputs (prompts, variables, datasets) and the expected behaviour or scoring criteria.
- Matrix / multi-configuration execution — running the same cases across models, temperatures, or prompt templates to compare them side by side.
- Assertions and scoring — deterministic checks (contains, regex, JSON schema) plus model-graded or metric-based scores.
- Tracing and observability — capturing inputs, outputs, latency, token usage, and intermediate steps for debugging.
- Regression tracking and CI integration — storing results so a change that degrades quality is caught before release.
A key testing insight: because outputs are non-deterministic, evaluation typically reports aggregate pass rates or score distributions across many cases rather than a single pass or fail on one run.
Evaluators: deterministic, statistical, and model-graded
Inside a framework, the component that turns an output into a score is called an evaluator (or grader), and three kinds recur. Deterministic evaluators apply exact rules — string equality, contains, regex, or JSON-schema validation — and are cheap and reproducible but only work when the correct answer is well defined. Statistical evaluators compute a metric such as n-gram overlap or embedding similarity against a reference (covered in 5.2). Model-graded evaluators use another LLM to judge the output against a rubric, which is flexible for open-ended tasks but slower and non-deterministic. A mature suite mixes all three, choosing the cheapest evaluator that still detects the failure you care about.
Good evaluation also depends on a curated dataset: a versioned set of representative inputs, ideally including known edge cases and past production incidents, with expected outputs or acceptance criteria. Because the dataset is the yardstick against which every model change is measured, it should be treated as a first-class, version-controlled test asset that grows as new failure modes are discovered.
The main frameworks and what they focus on
The syllabus expects familiarity with several representative tools. They overlap but emphasise different concerns.
| Framework | Primary focus | Typical use |
|---|---|---|
| Promptfoo | Declarative prompt testing and side-by-side matrix evaluation | Compare prompts/models via a config file, assert on outputs, run in CI |
| LangSmith | Tracing plus evaluation of LLM applications | Debug multi-step chains/agents, run dataset evals, monitor production |
| OpenAI Evals | A registry/framework for defining and sharing evals | Author reusable eval templates and run them against models |
| RAGAS | RAG-specific quality metrics | Score retrieval-augmented pipelines on faithfulness and relevancy |
Promptfoo is a declarative, configuration-driven tool. A tester writes prompts, providers (models), and test cases with assertions in a config file, then runs a matrix evaluation that shows each prompt and model combination together. This suits prompt comparison and lightweight CI checks where you assert that outputs contain, match, or are model-graded against expected criteria.
LangSmith centres on tracing and evaluation of LLM applications, especially multi-step chains and agents. It records the full execution trace so a tester can inspect intermediate calls, then run evaluations over curated datasets and monitor behaviour in production. Its strength is observability of complex pipelines, not only single-prompt scoring.
OpenAI Evals provides a framework and registry pattern for defining evaluations as reusable, shareable specifications. Testers compose eval templates (for example, exact-match or model-graded evals) and run them against models, which encourages a library of standardised tests instead of one-off scripts.
RAGAS is specialised for retrieval-augmented generation. Rather than general text assertions it computes RAG-oriented metrics such as faithfulness (is the answer grounded in the retrieved context?), answer relevancy (does the answer address the question?), and context precision / context recall (did retrieval return the right supporting passages?). This lets testers separate retrieval failures from generation failures.
Choosing and combining frameworks
These tools are not mutually exclusive. A common pattern uses a tracing-focused tool to observe an application, a declarative tool to run prompt regressions in CI, and a RAG-specific tool to score the retrieval layer. The tester's job is to map each framework's capabilities to the risks that matter: prompt drift, model upgrades, retrieval quality, or agent behaviour.
What frameworks do not solve
Frameworks provide the harness, not the judgement. They still depend on good test data, meaningful assertions, and appropriate metrics — a matrix run with weak assertions gives false confidence. Vendor feature sets change quickly, so testers should treat specific product claims as neutral capabilities to verify rather than guarantees, and anchor certification-level understanding on the category of work each tool performs rather than a marketing feature list.
A tester wants to run the same set of prompts against three different models and two prompt templates, then compare the outputs side by side in a config-driven file that can run in CI. Which tool best matches this need?
A retrieval-augmented generation (RAG) pipeline returns answers that are fluent but sometimes not supported by the retrieved documents. Which framework and metric most directly targets whether the answer is grounded in the retrieved context?
Why do evaluation frameworks typically report aggregate pass rates or score distributions rather than a single pass/fail from one execution?