4.3 Pipeline Monitoring, Alerting, and Error Handling

Key Takeaways

  • Job and task run history (durations, result_state, error messages) is visible in the Runs tab and queryable account-wide through the system.lakeflow system tables.
  • The job_run_timeline and job_task_run_timeline system tables record active periods; joining them to system.billing.usage attributes DBU cost to specific jobs.
  • Lakeflow Declarative Pipelines write a structured event log capturing data-quality (expectation) results, lineage, pipeline progress, and audit records.
  • Event hooks let a pipeline run a custom Python action (such as sending an alert) when specific events are logged.
  • Job-level notifications (email, Slack, webhook) fire on start, success, failure, or duration-exceeded, and SQL Alerts can watch metric queries over the event log or system tables.
Last updated: June 2026

Run History and Result States

Every job run and task run is recorded with a start time, duration, a life-cycle state (PENDING, RUNNING, TERMINATED, SKIPPED, INTERNAL_ERROR) and a final result state (SUCCEEDED, FAILED, CANCELED, TIMED_OUT, EXCLUDED). The Runs tab shows the matrix view of every run and the per-task DAG, where you click a failed task to read its error message, driver/Spark logs, and the Spark UI.

A run is considered unsuccessful (and eligible for retries) when it finishes with a FAILED result_state or an INTERNAL_ERROR life-cycle state. This vocabulary matters because the same fields appear in the system tables, so the UI and the SQL you write to monitor at scale use identical terms.

System Tables for Fleet-Wide Monitoring

The Runs UI is per-job; to monitor all jobs across the account you query the system tables in the system.lakeflow schema. The key tables:

System tableWhat it records
system.lakeflow.jobsJob definitions/metadata (slowly changing)
system.lakeflow.job_tasksTask definitions per job
system.lakeflow.job_run_timelineActive period of each job run (period_start_time / period_end_time, result_state)
system.lakeflow.job_task_run_timelineActive period of each task run

Because job_run_timeline carries the active period, you join it to system.billing.usage to attribute DBU spend to specific jobs and find your most expensive pipelines. (Note: from January 19, 2026, new timeline rows are sliced on clock-hour-aligned boundaries; existing rows are unchanged.) These tables typically lag real time, so they are for analysis and chargeback, not sub-minute alerting.

The Lakeflow Declarative Pipelines Event Log

A Lakeflow Declarative Pipeline writes a structured event log containing everything about that pipeline: audit records, data-quality (expectation) results, pipeline progress, cluster events, and data lineage. You can publish the event log to a Unity Catalog table and query it with SQL. Useful filters by event_type:

  • flow_progress — row counts and the expectations metrics (records passed/failed each constraint).
  • flow_definition — schema and lineage of each dataset.
  • update_progress — overall pipeline-update state transitions.

This is how you answer "how many rows failed my expect_or_drop rule last night?" — you read the flow_progress events' details:flow_progress.data_quality.expectations. Event hooks let you register a Python callback that runs when matching events are logged — for example, calling a webhook to alert on a failed update.

Notifications, Alerts, and Error Handling

Job notifications are configured per job or per task and fire on start, success, failure, or duration-exceeded, delivered by email, Slack, Microsoft Teams, PagerDuty, or a custom webhook. Use duration-exceeded to catch a pipeline that is silently running slow rather than failing.

Databricks SQL Alerts run a scheduled SQL query against the event log or system tables and notify when a metric crosses a threshold (for example, expectation failure rate > 1%).

Error-handling patterns you should recognize:

  • Add a downstream task with Run if = At least one failed to send a custom alert or run cleanup.
  • Set retries with a sensible min_retry_interval for transient cloud errors.
  • Use timeouts so a hung task fails fast instead of burning compute.
  • Watch the event log and system tables for slow degradation, not just hard failures.

Reading the Three Observability Layers Together

Think of observability as three complementary layers, each answering a different question:

LayerBest forLatency
Jobs Runs UIDebugging one run — logs, Spark UI, the failed taskReal-time
Pipeline event logData-quality, lineage, expectation results for a Lakeflow pipelineNear real-time
system.lakeflow tablesAccount-wide trends, cost attribution, reliability KPIsDelayed (analytical)

Use the UI when a pager fires and you need root cause now. Use the event log to answer data-quality questions for a declarative pipeline. Use the system tables for the weekly review — which jobs fail most, which cost the most DBUs, which are trending slower.

Operational Metrics Worth Tracking

Mature teams alert on more than pass/fail:

  • Failure rate per job over a rolling window (catches flaky pipelines).
  • Run duration trend — a creeping increase signals data growth or skew before it becomes a timeout.
  • Expectation failure rate from the event log — rising bad-data percentages.
  • DBU cost per run from the system tables joined to billing — catches a cluster that was accidentally oversized.

Wiring these into SQL Alerts turns raw run history into proactive signals instead of after-the-fact firefighting.

Exam pointers

Know the system.lakeflow system tables; that a Lakeflow Declarative Pipeline writes data-quality and lineage to its event log; and that event hooks run custom Python (such as an alert) on logged events. Notifications fire on start, success, failure, or duration-exceeded over email, Slack, or webhook. The system tables are delayed and meant for analysis and cost chargeback, not sub-minute paging. For immediate failure response, rely on job-level notifications and a downstream Run-if branch; for trend and cost analysis, query the system tables and the published event log on a schedule with Databricks SQL Alerts.

Monitoring quick recap

  • The Jobs/Runs UI shows run life-cycle and result states for ad-hoc triage.
  • System tables (system.lakeflow, joined to system.billing.usage) give queryable history and cost attribution.
  • Lakeflow Declarative Pipelines event logs expose data-quality (expectation) results and lineage; pair with SQL Alerts and job notifications for proactive failure detection.
Test Your Knowledge

You need to attribute DBU cost to individual jobs across the whole account to find your most expensive pipelines. What is the best approach?

A
B
C
D
Test Your Knowledge

Where does a Lakeflow Declarative Pipeline record how many rows passed or failed each data-quality expectation?

A
B
C
D
Test Your Knowledge

A production job sometimes runs much longer than normal without failing. Which notification setting best surfaces this?

A
B
C
D
Test Your Knowledge

Which layer is the right place to answer 'how many rows failed my expectation last night?' for a Lakeflow Declarative Pipeline?

A
B
C
D