4.3 Pipeline Monitoring, Alerting, and Error Handling
Key Takeaways
- Job and task run history (durations, result_state, error messages) is visible in the Runs tab and queryable account-wide through the system.lakeflow system tables.
- The job_run_timeline and job_task_run_timeline system tables record active periods; joining them to system.billing.usage attributes DBU cost to specific jobs.
- Lakeflow Declarative Pipelines write a structured event log capturing data-quality (expectation) results, lineage, pipeline progress, and audit records.
- Event hooks let a pipeline run a custom Python action (such as sending an alert) when specific events are logged.
- Job-level notifications (email, Slack, webhook) fire on start, success, failure, or duration-exceeded, and SQL Alerts can watch metric queries over the event log or system tables.
Run History and Result States
Every job run and task run is recorded with a start time, duration, a life-cycle state (PENDING, RUNNING, TERMINATED, SKIPPED, INTERNAL_ERROR) and a final result state (SUCCEEDED, FAILED, CANCELED, TIMED_OUT, EXCLUDED). The Runs tab shows the matrix view of every run and the per-task DAG, where you click a failed task to read its error message, driver/Spark logs, and the Spark UI.
A run is considered unsuccessful (and eligible for retries) when it finishes with a FAILED result_state or an INTERNAL_ERROR life-cycle state. This vocabulary matters because the same fields appear in the system tables, so the UI and the SQL you write to monitor at scale use identical terms.
System Tables for Fleet-Wide Monitoring
The Runs UI is per-job; to monitor all jobs across the account you query the system tables in the system.lakeflow schema. The key tables:
| System table | What it records |
|---|---|
system.lakeflow.jobs | Job definitions/metadata (slowly changing) |
system.lakeflow.job_tasks | Task definitions per job |
system.lakeflow.job_run_timeline | Active period of each job run (period_start_time / period_end_time, result_state) |
system.lakeflow.job_task_run_timeline | Active period of each task run |
Because job_run_timeline carries the active period, you join it to system.billing.usage to attribute DBU spend to specific jobs and find your most expensive pipelines. (Note: from January 19, 2026, new timeline rows are sliced on clock-hour-aligned boundaries; existing rows are unchanged.) These tables typically lag real time, so they are for analysis and chargeback, not sub-minute alerting.
The Lakeflow Declarative Pipelines Event Log
A Lakeflow Declarative Pipeline writes a structured event log containing everything about that pipeline: audit records, data-quality (expectation) results, pipeline progress, cluster events, and data lineage. You can publish the event log to a Unity Catalog table and query it with SQL. Useful filters by event_type:
- flow_progress — row counts and the expectations metrics (records passed/failed each constraint).
- flow_definition — schema and lineage of each dataset.
- update_progress — overall pipeline-update state transitions.
This is how you answer "how many rows failed my expect_or_drop rule last night?" — you read the flow_progress events' details:flow_progress.data_quality.expectations. Event hooks let you register a Python callback that runs when matching events are logged — for example, calling a webhook to alert on a failed update.
Notifications, Alerts, and Error Handling
Job notifications are configured per job or per task and fire on start, success, failure, or duration-exceeded, delivered by email, Slack, Microsoft Teams, PagerDuty, or a custom webhook. Use duration-exceeded to catch a pipeline that is silently running slow rather than failing.
Databricks SQL Alerts run a scheduled SQL query against the event log or system tables and notify when a metric crosses a threshold (for example, expectation failure rate > 1%).
Error-handling patterns you should recognize:
- Add a downstream task with Run if = At least one failed to send a custom alert or run cleanup.
- Set retries with a sensible
min_retry_intervalfor transient cloud errors. - Use timeouts so a hung task fails fast instead of burning compute.
- Watch the event log and system tables for slow degradation, not just hard failures.
Reading the Three Observability Layers Together
Think of observability as three complementary layers, each answering a different question:
| Layer | Best for | Latency |
|---|---|---|
| Jobs Runs UI | Debugging one run — logs, Spark UI, the failed task | Real-time |
| Pipeline event log | Data-quality, lineage, expectation results for a Lakeflow pipeline | Near real-time |
| system.lakeflow tables | Account-wide trends, cost attribution, reliability KPIs | Delayed (analytical) |
Use the UI when a pager fires and you need root cause now. Use the event log to answer data-quality questions for a declarative pipeline. Use the system tables for the weekly review — which jobs fail most, which cost the most DBUs, which are trending slower.
Operational Metrics Worth Tracking
Mature teams alert on more than pass/fail:
- Failure rate per job over a rolling window (catches flaky pipelines).
- Run duration trend — a creeping increase signals data growth or skew before it becomes a timeout.
- Expectation failure rate from the event log — rising bad-data percentages.
- DBU cost per run from the system tables joined to billing — catches a cluster that was accidentally oversized.
Wiring these into SQL Alerts turns raw run history into proactive signals instead of after-the-fact firefighting.
Exam pointers
Know the system.lakeflow system tables; that a Lakeflow Declarative Pipeline writes data-quality and lineage to its event log; and that event hooks run custom Python (such as an alert) on logged events. Notifications fire on start, success, failure, or duration-exceeded over email, Slack, or webhook. The system tables are delayed and meant for analysis and cost chargeback, not sub-minute paging. For immediate failure response, rely on job-level notifications and a downstream Run-if branch; for trend and cost analysis, query the system tables and the published event log on a schedule with Databricks SQL Alerts.
Monitoring quick recap
- The Jobs/Runs UI shows run life-cycle and result states for ad-hoc triage.
- System tables (
system.lakeflow, joined tosystem.billing.usage) give queryable history and cost attribution. - Lakeflow Declarative Pipelines event logs expose data-quality (expectation) results and lineage; pair with SQL Alerts and job notifications for proactive failure detection.
You need to attribute DBU cost to individual jobs across the whole account to find your most expensive pipelines. What is the best approach?
Where does a Lakeflow Declarative Pipeline record how many rows passed or failed each data-quality expectation?
A production job sometimes runs much longer than normal without failing. Which notification setting best surfaces this?
Which layer is the right place to answer 'how many rows failed my expectation last night?' for a Lakeflow Declarative Pipeline?