4.3 Pipeline Monitoring, Alerting, and Error Handling
Key Takeaways
- Lakeflow Declarative Pipeline UI shows run history, dataset lineage, data quality metrics, and processing durations.
- Databricks SQL Alerts trigger notifications when a SQL query result meets a specified condition (e.g., row count drops to zero).
- Job email and webhook notifications can alert teams on job start, success, failure, or duration threshold violations.
- Structured Streaming metrics (numInputRows, inputRowsPerSecond, processedRowsPerSecond) monitor streaming pipeline health.
- The event log for Lakeflow Declarative Pipelines stores detailed metrics, expectations results, and lineage information.
Pipeline Monitoring, Alerting, and Error Handling
Quick Answer: Monitor pipelines using the Lakeflow UI (run history, data quality metrics), Databricks SQL Alerts (query-based notifications), job notifications (email/webhook on success/failure), and streaming metrics (processing rates, latency). The event log provides detailed pipeline telemetry.
Lakeflow Declarative Pipeline Monitoring
Pipeline UI Dashboard
The pipeline UI provides real-time and historical monitoring:
| Metric | Description |
|---|---|
| Run history | Status (success/failure), duration, and timestamps for each run |
| Dataset lineage | Visual DAG showing data flow between tables |
| Expectations metrics | Pass/fail counts and percentages for data quality rules |
| Processing details | Records processed, files read, write durations |
| Cluster utilization | Compute resource usage and autoscaling events |
Event Log
The event log stores detailed pipeline telemetry:
-- Query the event log for a specific pipeline
SELECT
timestamp,
event_type,
details
FROM event_log(TABLE(my_catalog.my_schema.my_pipeline))
WHERE event_type IN ('flow_progress', 'dataset_definition', 'expectation')
ORDER BY timestamp DESC;
Expectation Monitoring
-- Monitor data quality metrics from the event log
SELECT
details:flow_progress:dataset AS dataset,
details:flow_progress:expectations:name AS expectation_name,
details:flow_progress:expectations:passed_records AS passed,
details:flow_progress:expectations:failed_records AS failed
FROM event_log(TABLE(my_catalog.my_schema.my_pipeline))
WHERE event_type = 'flow_progress'
AND details:flow_progress:expectations IS NOT NULL;
Databricks SQL Alerts
SQL Alerts monitor query results and send notifications:
Creating an Alert
- Write a SQL query that returns a value to monitor
- Set a trigger condition (e.g., value > threshold, row count = 0)
- Configure the check schedule (how often to run the query)
- Set notification destinations (email, Slack, PagerDuty, webhook)
-- Alert query: detect if no new data arrived in the last hour
SELECT COUNT(*) AS new_records
FROM my_catalog.my_schema.raw_events
WHERE _ingest_timestamp >= current_timestamp() - INTERVAL 1 HOUR;
-- Alert triggers when new_records = 0
Common Alert Scenarios
| Scenario | Alert Condition |
|---|---|
| Data freshness | No new records in the last N hours |
| Data quality | Null rate exceeds threshold |
| Volume anomaly | Record count deviates from expected range |
| Pipeline delay | Processing lag exceeds SLA |
| Cost monitoring | Daily compute cost exceeds budget |
Job Notifications
Notification Triggers
| Trigger | When It Fires |
|---|---|
| On start | When a job run begins |
| On success | When all tasks complete successfully |
| On failure | When any task fails (after retries) |
| On duration warning | When run time exceeds a threshold |
Notification Destinations
- Email — Send to one or more email addresses
- Slack — Post to a Slack channel via webhook
- PagerDuty — Create incidents for critical failures
- Webhook — Send to any HTTP endpoint
- Microsoft Teams — Post to a Teams channel
Error Handling Best Practices
Retry Configuration
Task retry settings:
- Max retries: 2 (retry twice before failing)
- Min retry interval: 60 seconds (wait 1 minute between retries)
- Retry on timeout: true (retry if the task times out)
Handling Failures in Notebooks
# Exit with a status to control downstream task behavior
try:
# Processing logic
result = process_data()
dbutils.notebook.exit("SUCCESS")
except Exception as e:
# Log the error and exit with failure status
print(f"Error: {str(e)}")
dbutils.notebook.exit("FAILURE")
Dead-Letter Pattern
# Route bad records to a dead-letter table instead of failing
try:
process_record(record)
except Exception:
# Write to dead-letter table for manual investigation
write_to_dead_letter(record, error_details)
Streaming Monitoring
Key Streaming Metrics
| Metric | Description | Healthy Range |
|---|---|---|
| inputRowsPerSecond | Rate of incoming data | Stable or growing |
| processedRowsPerSecond | Rate of processing | >= inputRowsPerSecond |
| numInputRows | Records in current batch | Non-zero during active streams |
| batchDuration | Time to process each batch | Below trigger interval |
# Monitor streaming query progress
query = df.writeStream.start()
# Get latest progress
progress = query.lastProgress
print(f"Input rate: {progress['inputRowsPerSecond']}")
print(f"Processing rate: {progress['processedRowsPerSecond']}")
print(f"Batch duration: {progress['batchDuration']}ms")
On the Exam: Know how to set up SQL Alerts for data freshness monitoring, understand job notification triggers, and be familiar with the Lakeflow pipeline event log for troubleshooting data quality issues.
A data engineer wants to be notified if no new data arrives in a Delta table for more than 2 hours. Which Databricks feature should they use?
In Databricks Workflows, which notification trigger fires when a job run takes longer than expected?
How can a data engineer pass error information from a failed notebook task to downstream error-handling tasks?