4.3 Pipeline Monitoring, Alerting, and Error Handling

Key Takeaways

Job and task run history (durations, result_state, error messages) is visible in the Runs tab and queryable account-wide through the system.lakeflow system tables.
The job_run_timeline and job_task_run_timeline system tables record active periods; joining them to system.billing.usage attributes DBU cost to specific jobs.
Lakeflow Declarative Pipelines write a structured event log capturing data-quality (expectation) results, lineage, pipeline progress, and audit records.
Event hooks let a pipeline run a custom Python action (such as sending an alert) when specific events are logged.
Job-level notifications (email, Slack, webhook) fire on start, success, failure, or duration-exceeded, and SQL Alerts can watch metric queries over the event log or system tables.

Last updated: June 2026

Run History and Result States

Every job run and task run is recorded with a start time, duration, a life-cycle state (PENDING, RUNNING, TERMINATED, SKIPPED, INTERNAL_ERROR) and a final result state (SUCCEEDED, FAILED, CANCELED, TIMED_OUT, EXCLUDED). The Runs tab shows the matrix view of every run and the per-task DAG, where you click a failed task to read its error message, driver/Spark logs, and the Spark UI.

A run is considered unsuccessful (and eligible for retries) when it finishes with a FAILED result_state or an INTERNAL_ERROR life-cycle state. This vocabulary matters because the same fields appear in the system tables, so the UI and the SQL you write to monitor at scale use identical terms.

System Tables for Fleet-Wide Monitoring

The Runs UI is per-job; to monitor all jobs across the account you query the system tables in the system.lakeflow schema. The key tables:

System table	What it records
`system.lakeflow.jobs`	Job definitions/metadata (slowly changing)
`system.lakeflow.job_tasks`	Task definitions per job
`system.lakeflow.job_run_timeline`	Active period of each job run (period_start_time / period_end_time, result_state)
`system.lakeflow.job_task_run_timeline`	Active period of each task run

Because job_run_timeline carries the active period, you join it to system.billing.usage to attribute DBU spend to specific jobs and find your most expensive pipelines. (Note: from January 19, 2026, new timeline rows are sliced on clock-hour-aligned boundaries; existing rows are unchanged.) These tables typically lag real time, so they are for analysis and chargeback, not sub-minute alerting.

The Lakeflow Declarative Pipelines Event Log

A Lakeflow Declarative Pipeline writes a structured event log containing everything about that pipeline: audit records, data-quality (expectation) results, pipeline progress, cluster events, and data lineage. You can publish the event log to a Unity Catalog table and query it with SQL. Useful filters by event_type:

flow_progress — row counts and the expectations metrics (records passed/failed each constraint).
flow_definition — schema and lineage of each dataset.
update_progress — overall pipeline-update state transitions.

This is how you answer "how many rows failed my expect_or_drop rule last night?" — you read the flow_progress events' details:flow_progress.data_quality.expectations. Event hooks let you register a Python callback that runs when matching events are logged — for example, calling a webhook to alert on a failed update.

Notifications, Alerts, and Error Handling

Job notifications are configured per job or per task and fire on start, success, failure, or duration-exceeded, delivered by email, Slack, Microsoft Teams, PagerDuty, or a custom webhook. Use duration-exceeded to catch a pipeline that is silently running slow rather than failing.

Databricks SQL Alerts run a scheduled SQL query against the event log or system tables and notify when a metric crosses a threshold (for example, expectation failure rate > 1%).

Error-handling patterns you should recognize:

Add a downstream task with Run if = At least one failed to send a custom alert or run cleanup.
Set retries with a sensible min_retry_interval for transient cloud errors.
Use timeouts so a hung task fails fast instead of burning compute.
Watch the event log and system tables for slow degradation, not just hard failures.

Reading the Three Observability Layers Together

Think of observability as three complementary layers, each answering a different question:

Layer	Best for	Latency
Jobs Runs UI	Debugging one run — logs, Spark UI, the failed task	Real-time
Pipeline event log	Data-quality, lineage, expectation results for a Lakeflow pipeline	Near real-time
system.lakeflow tables	Account-wide trends, cost attribution, reliability KPIs	Delayed (analytical)

Use the UI when a pager fires and you need root cause now. Use the event log to answer data-quality questions for a declarative pipeline. Use the system tables for the weekly review — which jobs fail most, which cost the most DBUs, which are trending slower.

Operational Metrics Worth Tracking

Mature teams alert on more than pass/fail:

Failure rate per job over a rolling window (catches flaky pipelines).
Run duration trend — a creeping increase signals data growth or skew before it becomes a timeout.
Expectation failure rate from the event log — rising bad-data percentages.
DBU cost per run from the system tables joined to billing — catches a cluster that was accidentally oversized.

Wiring these into SQL Alerts turns raw run history into proactive signals instead of after-the-fact firefighting.

Exam pointers

Know the system.lakeflow system tables; that a Lakeflow Declarative Pipeline writes data-quality and lineage to its event log; and that event hooks run custom Python (such as an alert) on logged events. Notifications fire on start, success, failure, or duration-exceeded over email, Slack, or webhook. The system tables are delayed and meant for analysis and cost chargeback, not sub-minute paging. For immediate failure response, rely on job-level notifications and a downstream Run-if branch; for trend and cost analysis, query the system tables and the published event log on a schedule with Databricks SQL Alerts.

Monitoring quick recap

The Jobs/Runs UI shows run life-cycle and result states for ad-hoc triage.
System tables (system.lakeflow, joined to system.billing.usage) give queryable history and cost attribution.
Lakeflow Declarative Pipelines event logs expose data-quality (expectation) results and lineage; pair with SQL Alerts and job notifications for proactive failure detection.

Test Your Knowledge

You need to attribute DBU cost to individual jobs across the whole account to find your most expensive pipelines. What is the best approach?

Read each job's Spark UI manually

Enable continuous mode on every job

Join system.lakeflow.job_run_timeline to system.billing.usage

Export the Runs UI matrix to CSV daily

Test Your Knowledge

Where does a Lakeflow Declarative Pipeline record how many rows passed or failed each data-quality expectation?

In the cluster's local /tmp logs only

In the pipeline event log, in flow_progress events

Only in the billing system table

In the notebook revision history

Test Your Knowledge

A production job sometimes runs much longer than normal without failing. Which notification setting best surfaces this?

On start

On success

On duration exceeded (warning threshold)

On cancel

Test Your Knowledge

Which layer is the right place to answer 'how many rows failed my expectation last night?' for a Lakeflow Declarative Pipeline?

The cluster's Ganglia metrics

system.billing.usage

The Jobs cost dashboard

The pipeline event log's flow_progress events

Up Next

4.4 Testing and Deployment Best Practices

Continue learning

Databricks Certified Data Engineer Associate

Databricks Certified Data Engineer Associate

4.3 Pipeline Monitoring, Alerting, and Error Handling

Key Takeaways

Run History and Result States

System Tables for Fleet-Wide Monitoring

The Lakeflow Declarative Pipelines Event Log

Notifications, Alerts, and Error Handling

Reading the Three Observability Layers Together

Operational Metrics Worth Tracking

Exam pointers

Monitoring quick recap

Databricks Certified Data Engineer Associate

1Introduction

2Domain 1: Databricks Intelligence Platform (10%)

3Domain 2: Development and Ingestion (30%)

4Domain 3: Data Processing & Transformations (31%)

5Domain 4: Productionizing Data Pipelines (18%)

6Domain 5: Data Governance & Quality (11%)

Databricks Certified Data Engineer Associate

4.3 Pipeline Monitoring, Alerting, and Error Handling

Key Takeaways

Run History and Result States

System Tables for Fleet-Wide Monitoring

The Lakeflow Declarative Pipelines Event Log

Notifications, Alerts, and Error Handling

Reading the Three Observability Layers Together

Operational Metrics Worth Tracking

Exam pointers

Monitoring quick recap