4.3 Pipeline Monitoring, Alerting, and Error Handling

Key Takeaways

  • Lakeflow Declarative Pipeline UI shows run history, dataset lineage, data quality metrics, and processing durations.
  • Databricks SQL Alerts trigger notifications when a SQL query result meets a specified condition (e.g., row count drops to zero).
  • Job email and webhook notifications can alert teams on job start, success, failure, or duration threshold violations.
  • Structured Streaming metrics (numInputRows, inputRowsPerSecond, processedRowsPerSecond) monitor streaming pipeline health.
  • The event log for Lakeflow Declarative Pipelines stores detailed metrics, expectations results, and lineage information.
Last updated: March 2026

Pipeline Monitoring, Alerting, and Error Handling

Quick Answer: Monitor pipelines using the Lakeflow UI (run history, data quality metrics), Databricks SQL Alerts (query-based notifications), job notifications (email/webhook on success/failure), and streaming metrics (processing rates, latency). The event log provides detailed pipeline telemetry.

Lakeflow Declarative Pipeline Monitoring

Pipeline UI Dashboard

The pipeline UI provides real-time and historical monitoring:

MetricDescription
Run historyStatus (success/failure), duration, and timestamps for each run
Dataset lineageVisual DAG showing data flow between tables
Expectations metricsPass/fail counts and percentages for data quality rules
Processing detailsRecords processed, files read, write durations
Cluster utilizationCompute resource usage and autoscaling events

Event Log

The event log stores detailed pipeline telemetry:

-- Query the event log for a specific pipeline
SELECT
    timestamp,
    event_type,
    details
FROM event_log(TABLE(my_catalog.my_schema.my_pipeline))
WHERE event_type IN ('flow_progress', 'dataset_definition', 'expectation')
ORDER BY timestamp DESC;

Expectation Monitoring

-- Monitor data quality metrics from the event log
SELECT
    details:flow_progress:dataset AS dataset,
    details:flow_progress:expectations:name AS expectation_name,
    details:flow_progress:expectations:passed_records AS passed,
    details:flow_progress:expectations:failed_records AS failed
FROM event_log(TABLE(my_catalog.my_schema.my_pipeline))
WHERE event_type = 'flow_progress'
    AND details:flow_progress:expectations IS NOT NULL;

Databricks SQL Alerts

SQL Alerts monitor query results and send notifications:

Creating an Alert

  1. Write a SQL query that returns a value to monitor
  2. Set a trigger condition (e.g., value > threshold, row count = 0)
  3. Configure the check schedule (how often to run the query)
  4. Set notification destinations (email, Slack, PagerDuty, webhook)
-- Alert query: detect if no new data arrived in the last hour
SELECT COUNT(*) AS new_records
FROM my_catalog.my_schema.raw_events
WHERE _ingest_timestamp >= current_timestamp() - INTERVAL 1 HOUR;

-- Alert triggers when new_records = 0

Common Alert Scenarios

ScenarioAlert Condition
Data freshnessNo new records in the last N hours
Data qualityNull rate exceeds threshold
Volume anomalyRecord count deviates from expected range
Pipeline delayProcessing lag exceeds SLA
Cost monitoringDaily compute cost exceeds budget

Job Notifications

Notification Triggers

TriggerWhen It Fires
On startWhen a job run begins
On successWhen all tasks complete successfully
On failureWhen any task fails (after retries)
On duration warningWhen run time exceeds a threshold

Notification Destinations

  • Email — Send to one or more email addresses
  • Slack — Post to a Slack channel via webhook
  • PagerDuty — Create incidents for critical failures
  • Webhook — Send to any HTTP endpoint
  • Microsoft Teams — Post to a Teams channel

Error Handling Best Practices

Retry Configuration

Task retry settings:
- Max retries: 2 (retry twice before failing)
- Min retry interval: 60 seconds (wait 1 minute between retries)
- Retry on timeout: true (retry if the task times out)

Handling Failures in Notebooks

# Exit with a status to control downstream task behavior
try:
    # Processing logic
    result = process_data()
    dbutils.notebook.exit("SUCCESS")
except Exception as e:
    # Log the error and exit with failure status
    print(f"Error: {str(e)}")
    dbutils.notebook.exit("FAILURE")

Dead-Letter Pattern

# Route bad records to a dead-letter table instead of failing
try:
    process_record(record)
except Exception:
    # Write to dead-letter table for manual investigation
    write_to_dead_letter(record, error_details)

Streaming Monitoring

Key Streaming Metrics

MetricDescriptionHealthy Range
inputRowsPerSecondRate of incoming dataStable or growing
processedRowsPerSecondRate of processing>= inputRowsPerSecond
numInputRowsRecords in current batchNon-zero during active streams
batchDurationTime to process each batchBelow trigger interval
# Monitor streaming query progress
query = df.writeStream.start()

# Get latest progress
progress = query.lastProgress
print(f"Input rate: {progress['inputRowsPerSecond']}")
print(f"Processing rate: {progress['processedRowsPerSecond']}")
print(f"Batch duration: {progress['batchDuration']}ms")

On the Exam: Know how to set up SQL Alerts for data freshness monitoring, understand job notification triggers, and be familiar with the Lakeflow pipeline event log for troubleshooting data quality issues.

Test Your Knowledge

A data engineer wants to be notified if no new data arrives in a Delta table for more than 2 hours. Which Databricks feature should they use?

A
B
C
D
Test Your Knowledge

In Databricks Workflows, which notification trigger fires when a job run takes longer than expected?

A
B
C
D
Test Your Knowledge

How can a data engineer pass error information from a failed notebook task to downstream error-handling tasks?

A
B
C
D