A RAG application returns an answer that is factually correct, but the retrieved passages do not actually support it. Which evaluation metric most directly flags this problem?

Faithfulness / groundedness. Faithfulness/groundedness measures whether the answer is supported by the retrieved context. Correctness can still look good because the model answered from prior knowledge, and recall@k and latency say nothing about whether the response is grounded in the evidence.

After enabling AI Gateway inference tables on a serving endpoint, which action is most likely to break or corrupt the logging?

Manually renaming the inference table or changing its schema. Databricks manages the inference table it creates; manually renaming it or altering its schema breaks the logging pipeline. Reading it with SQL, granting query access, or monitoring it are all normal, safe operations.

After a model update, thumbs-down feedback rises specifically for billing questions while p95 latency and error rate stay flat. What does this most strongly indicate?

Behavioral quality drift in a specific user segment. Stable latency and error rate rule out infrastructure faults, and a segment-specific rise in negative feedback points to a decline in answer quality for that intent - behavioral quality drift. Cost regressions would show up in tokens per request, not thumbs-down.

Evaluation & Monitoring — Free Study Guide 2026

Offline Evaluation Before You Ship

Evaluation divides into offline (before deployment, against a fixed dataset) and online (live production monitoring). Offline evaluation answers does this build clear the quality bar? You run MLflow evaluation (mlflow.evaluate(), and for agents mlflow.genai.evaluate() under Mosaic AI Agent Evaluation) over a representative test set and compare candidates. The cardinal rule for a fair comparison - two prompt variants, two chunk sizes such as 300 vs 800 tokens, or two models - is to change only one variable while holding the corpus, queries, generator, and judge constant. Otherwise you cannot attribute the score movement to your change.

Building the dataset well matters as much as the metrics. Sample real user questions and include common tasks plus difficult edge cases. Do not build the set from only last month's escalated tickets - that over-represents hard failures and misrepresents normal traffic. A strong pattern is to join inference-table records with human labels or ground truth to grow a realistic evaluation corpus straight from production.

RAG and Agent Metrics

Databricks and MLflow ship built-in LLM-judge scorers. Know exactly what each measures and whether it needs ground truth:

Metric	Question it answers	Needs ground truth?
Correctness	Does the answer match the expected/reference answer?	Yes
Groundedness / Faithfulness	Is the answer supported by the retrieved context?	No
Relevance to query	Does the answer address the user's question?	No
Retrieval relevance / chunk relevance	Do the retrieved chunks match the question?	No
Safety / Toxicity	Is the output harmful or toxic?	No
Recall@k / Precision@k	Are the relevant documents in the top-k results?	Labeled relevant docs

Two distinctions are exam favorites. Groundedness/faithfulness is not correctness: an answer can be factually correct yet ungrounded because the model answered from prior knowledge while retrieval was weak - which is exactly why correctness can stay high while faithfulness drops after a retrieval-pipeline change. If an answer includes a refund exception that appears in no retrieved passage, faithfulness/groundedness is the metric that surfaces it. Retrieval relevance isolates the retriever (scored before generation), while Recall@k measures whether the relevant chunk reached the top-k. If the right chunk lands in the top 10 but rarely the top 3, you need better ranking near the top (re-ranking), not simply higher recall.

LLM-as-a-Judge Done Right

An LLM judge scales subjective scoring but must be controlled. Best practices: give the judge a clear rubric with criteria and example judgments; calibrate against a human-labeled sample and refine the rubric when agreement is low; and before trusting a judge for a release gate, measure its agreement with humans and revise if disagreement is high. When the judge shares a model family with the generator, reduce bias with explicit rubrics, human calibration, and preferably an independent judge setup. For comparing two strong variants, pairwise judging - blind the variant names and randomize order - beats independent 1-to-5 scores because it reduces scale inconsistency and position/verbosity bias. Watch a classic trap: if an offline score jumps overnight with no application or model change, suspect that the judge prompt or judge model itself changed, breaking comparability.

Online Monitoring and Inference Tables

Once the app is live, offline tests cannot see real traffic, so enable online monitoring. Its foundation is AI Gateway-enabled inference tables, which automatically log prompts, responses, HTTP status codes, latency/runtimes, token usage, and traces into a new Unity Catalog table you can query with SQL. Enabling them requires Can Manage on the endpoint plus USE CATALOG, USE SCHEMA, and CREATE TABLE on the target schema. Two operational gotchas: manually renaming the inference table or changing its schema corrupts logging, and if an endpoint already uses legacy inference tables you must disable the legacy configuration first before switching to AI Gateway tables.

On top of inference tables, Lakehouse Monitoring creates monitors that profile the logged data over time and surface drift and quality regressions with dashboards and alerts. A minimal production dashboard for a new RAG assistant tracks quality signals, safety/refusal rate, latency, error rate, and cost over time. Pair infrastructure metrics (p95 latency, error rate) with quality signals such as thumbs-up/thumbs-down feedback tied to each response. Distinguish failure types: if thumbs-down rises for billing questions while p95 latency and error rate stay flat, that is behavioral quality drift in a segment, not an infrastructure problem. To catch cost regressions, watch p95 latency and tokens per request; and prefer cost per successful task completion over cost per request, because it ties spend to delivered business value. After tightening content filters, monitor refusal rate by user intent to detect overblocking.

Feedback Loops and Human Review

Close the loop. The Review App lets subject-matter experts chat with a deployed app before launch and leave structured feedback on responses. In production, the most defensible use of thumbs-down data is to cluster low-rated traces, review root causes, and add representative failures to the evaluation dataset - converting live failures into regression tests. When offline judge scores improve but an online containment metric worsens after rollout, the eval set has a coverage gap: inspect the traces and expand it. This offline-to-online cycle - evaluate, deploy, monitor, harvest failures, re-evaluate - is the backbone of trustworthy GenAI on Databricks.

Databricks Generative AI Engineer Associate Certification

Databricks Generative AI Engineer Associate

6.2 Evaluation & Monitoring

Key Takeaways

Offline Evaluation Before You Ship

RAG and Agent Metrics

LLM-as-a-Judge Done Right

Online Monitoring and Inference Tables

Feedback Loops and Human Review

Databricks Generative AI Engineer Associate Certification

1Introduction & Exam Strategy

2Design Applications

3Data Preparation

4Application Development

5Assembling & Deploying Applications

6Governance, Evaluation & Monitoring

Databricks Generative AI Engineer Associate

6.2 Evaluation & Monitoring

Key Takeaways

Offline Evaluation Before You Ship

RAG and Agent Metrics

LLM-as-a-Judge Done Right

Online Monitoring and Inference Tables

Feedback Loops and Human Review