6.2 Evaluation & Monitoring
Key Takeaways
- Offline evaluation on a fixed test set must change one variable at a time - hold the corpus, queries, generator, and judge constant to attribute a score change.
- Groundedness/faithfulness (answer supported by retrieved context) is distinct from correctness (matches a reference); correctness needs ground truth, groundedness does not.
- Recall@k measures whether relevant chunks reach the top-k; if they land in the top 10 but not the top 3, improve ranking or re-ranking rather than recall.
- AI Gateway inference tables log prompts, responses, status codes, latency, and tokens to a new Unity Catalog table; renaming it or altering its schema breaks logging.
- Rising thumbs-down with flat p95 latency and error rate signals behavioral quality drift in a segment, not an infrastructure failure.
Offline Evaluation Before You Ship
Evaluation divides into offline (before deployment, against a fixed dataset) and online (live production monitoring). Offline evaluation answers does this build clear the quality bar? You run MLflow evaluation (mlflow.evaluate(), and for agents mlflow.genai.evaluate() under Mosaic AI Agent Evaluation) over a representative test set and compare candidates. The cardinal rule for a fair comparison - two prompt variants, two chunk sizes such as 300 vs 800 tokens, or two models - is to change only one variable while holding the corpus, queries, generator, and judge constant. Otherwise you cannot attribute the score movement to your change.
Building the dataset well matters as much as the metrics. Sample real user questions and include common tasks plus difficult edge cases. Do not build the set from only last month's escalated tickets - that over-represents hard failures and misrepresents normal traffic. A strong pattern is to join inference-table records with human labels or ground truth to grow a realistic evaluation corpus straight from production.
RAG and Agent Metrics
Databricks and MLflow ship built-in LLM-judge scorers. Know exactly what each measures and whether it needs ground truth:
| Metric | Question it answers | Needs ground truth? |
|---|---|---|
| Correctness | Does the answer match the expected/reference answer? | Yes |
| Groundedness / Faithfulness | Is the answer supported by the retrieved context? | No |
| Relevance to query | Does the answer address the user's question? | No |
| Retrieval relevance / chunk relevance | Do the retrieved chunks match the question? | No |
| Safety / Toxicity | Is the output harmful or toxic? | No |
| Recall@k / Precision@k | Are the relevant documents in the top-k results? | Labeled relevant docs |
Two distinctions are exam favorites. Groundedness/faithfulness is not correctness: an answer can be factually correct yet ungrounded because the model answered from prior knowledge while retrieval was weak - which is exactly why correctness can stay high while faithfulness drops after a retrieval-pipeline change. If an answer includes a refund exception that appears in no retrieved passage, faithfulness/groundedness is the metric that surfaces it. Retrieval relevance isolates the retriever (scored before generation), while Recall@k measures whether the relevant chunk reached the top-k. If the right chunk lands in the top 10 but rarely the top 3, you need better ranking near the top (re-ranking), not simply higher recall.
LLM-as-a-Judge Done Right
An LLM judge scales subjective scoring but must be controlled. Best practices: give the judge a clear rubric with criteria and example judgments; calibrate against a human-labeled sample and refine the rubric when agreement is low; and before trusting a judge for a release gate, measure its agreement with humans and revise if disagreement is high. When the judge shares a model family with the generator, reduce bias with explicit rubrics, human calibration, and preferably an independent judge setup. For comparing two strong variants, pairwise judging - blind the variant names and randomize order - beats independent 1-to-5 scores because it reduces scale inconsistency and position/verbosity bias. Watch a classic trap: if an offline score jumps overnight with no application or model change, suspect that the judge prompt or judge model itself changed, breaking comparability.
Online Monitoring and Inference Tables
Once the app is live, offline tests cannot see real traffic, so enable online monitoring. Its foundation is AI Gateway-enabled inference tables, which automatically log prompts, responses, HTTP status codes, latency/runtimes, token usage, and traces into a new Unity Catalog table you can query with SQL. Enabling them requires Can Manage on the endpoint plus USE CATALOG, USE SCHEMA, and CREATE TABLE on the target schema. Two operational gotchas: manually renaming the inference table or changing its schema corrupts logging, and if an endpoint already uses legacy inference tables you must disable the legacy configuration first before switching to AI Gateway tables.
On top of inference tables, Lakehouse Monitoring creates monitors that profile the logged data over time and surface drift and quality regressions with dashboards and alerts. A minimal production dashboard for a new RAG assistant tracks quality signals, safety/refusal rate, latency, error rate, and cost over time. Pair infrastructure metrics (p95 latency, error rate) with quality signals such as thumbs-up/thumbs-down feedback tied to each response. Distinguish failure types: if thumbs-down rises for billing questions while p95 latency and error rate stay flat, that is behavioral quality drift in a segment, not an infrastructure problem. To catch cost regressions, watch p95 latency and tokens per request; and prefer cost per successful task completion over cost per request, because it ties spend to delivered business value. After tightening content filters, monitor refusal rate by user intent to detect overblocking.
Feedback Loops and Human Review
Close the loop. The Review App lets subject-matter experts chat with a deployed app before launch and leave structured feedback on responses. In production, the most defensible use of thumbs-down data is to cluster low-rated traces, review root causes, and add representative failures to the evaluation dataset - converting live failures into regression tests. When offline judge scores improve but an online containment metric worsens after rollout, the eval set has a coverage gap: inspect the traces and expand it. This offline-to-online cycle - evaluate, deploy, monitor, harvest failures, re-evaluate - is the backbone of trustworthy GenAI on Databricks.
A RAG application returns an answer that is factually correct, but the retrieved passages do not actually support it. Which evaluation metric most directly flags this problem?
After enabling AI Gateway inference tables on a serving endpoint, which action is most likely to break or corrupt the logging?
After a model update, thumbs-down feedback rises specifically for billing questions while p95 latency and error rate stay flat. What does this most strongly indicate?
You've completed this section
Continue exploring other exams