A data engineer wants to be notified if no new data arrives in a Delta table for more than 2 hours. Which Databricks feature should they use?

A Databricks SQL Alert with a data freshness query. A Databricks SQL Alert runs a query on a schedule and triggers a notification when a condition is met. A freshness alert query checks if new records exist within the expected time window and alerts when the count is zero.

In Databricks Workflows, which notification trigger fires when a job run takes longer than expected?

On duration warning. The "on duration warning" trigger fires when a job run exceeds a configured time threshold. This helps detect performance regressions or data volume anomalies that cause pipelines to run longer than their SLA.

How can a data engineer pass error information from a failed notebook task to downstream error-handling tasks?

Use dbutils.notebook.exit() with a status message that downstream tasks can read. dbutils.notebook.exit() sets the output value of a task, which downstream tasks can access via task values. This allows error details to be passed to error-handling tasks for alerting, logging, or remediation.

Pipeline Monitoring, Alerting, and Error Handling

Quick Answer: Monitor pipelines using the Lakeflow UI (run history, data quality metrics), Databricks SQL Alerts (query-based notifications), job notifications (email/webhook on success/failure), and streaming metrics (processing rates, latency). The event log provides detailed pipeline telemetry.

Lakeflow Declarative Pipeline Monitoring

Pipeline UI Dashboard

The pipeline UI provides real-time and historical monitoring:

Metric	Description
Run history	Status (success/failure), duration, and timestamps for each run
Dataset lineage	Visual DAG showing data flow between tables
Expectations metrics	Pass/fail counts and percentages for data quality rules
Processing details	Records processed, files read, write durations
Cluster utilization	Compute resource usage and autoscaling events

Event Log

The event log stores detailed pipeline telemetry:

-- Query the event log for a specific pipeline
SELECT
    timestamp,
    event_type,
    details
FROM event_log(TABLE(my_catalog.my_schema.my_pipeline))
WHERE event_type IN ('flow_progress', 'dataset_definition', 'expectation')
ORDER BY timestamp DESC;

Expectation Monitoring

-- Monitor data quality metrics from the event log
SELECT
    details:flow_progress:dataset AS dataset,
    details:flow_progress:expectations:name AS expectation_name,
    details:flow_progress:expectations:passed_records AS passed,
    details:flow_progress:expectations:failed_records AS failed
FROM event_log(TABLE(my_catalog.my_schema.my_pipeline))
WHERE event_type = 'flow_progress'
    AND details:flow_progress:expectations IS NOT NULL;

Databricks SQL Alerts

SQL Alerts monitor query results and send notifications:

Creating an Alert

Write a SQL query that returns a value to monitor
Set a trigger condition (e.g., value > threshold, row count = 0)
Configure the check schedule (how often to run the query)
Set notification destinations (email, Slack, PagerDuty, webhook)

-- Alert query: detect if no new data arrived in the last hour
SELECT COUNT(*) AS new_records
FROM my_catalog.my_schema.raw_events
WHERE _ingest_timestamp >= current_timestamp() - INTERVAL 1 HOUR;

-- Alert triggers when new_records = 0

Common Alert Scenarios

Scenario	Alert Condition
Data freshness	No new records in the last N hours
Data quality	Null rate exceeds threshold
Volume anomaly	Record count deviates from expected range
Pipeline delay	Processing lag exceeds SLA
Cost monitoring	Daily compute cost exceeds budget

Job Notifications

Notification Triggers

Trigger	When It Fires
On start	When a job run begins
On success	When all tasks complete successfully
On failure	When any task fails (after retries)
On duration warning	When run time exceeds a threshold

Notification Destinations

Email — Send to one or more email addresses
Slack — Post to a Slack channel via webhook
PagerDuty — Create incidents for critical failures
Webhook — Send to any HTTP endpoint
Microsoft Teams — Post to a Teams channel

Error Handling Best Practices

Retry Configuration

Task retry settings:
- Max retries: 2 (retry twice before failing)
- Min retry interval: 60 seconds (wait 1 minute between retries)
- Retry on timeout: true (retry if the task times out)

Handling Failures in Notebooks

# Exit with a status to control downstream task behavior
try:
    # Processing logic
    result = process_data()
    dbutils.notebook.exit("SUCCESS")
except Exception as e:
    # Log the error and exit with failure status
    print(f"Error: {str(e)}")
    dbutils.notebook.exit("FAILURE")

Dead-Letter Pattern

# Route bad records to a dead-letter table instead of failing
try:
    process_record(record)
except Exception:
    # Write to dead-letter table for manual investigation
    write_to_dead_letter(record, error_details)

Streaming Monitoring

Key Streaming Metrics

Metric	Description	Healthy Range
inputRowsPerSecond	Rate of incoming data	Stable or growing
processedRowsPerSecond	Rate of processing	>= inputRowsPerSecond
numInputRows	Records in current batch	Non-zero during active streams
batchDuration	Time to process each batch	Below trigger interval

# Monitor streaming query progress
query = df.writeStream.start()

# Get latest progress
progress = query.lastProgress
print(f"Input rate: {progress['inputRowsPerSecond']}")
print(f"Processing rate: {progress['processedRowsPerSecond']}")
print(f"Batch duration: {progress['batchDuration']}ms")

On the Exam: Know how to set up SQL Alerts for data freshness monitoring, understand job notification triggers, and be familiar with the Lakeflow pipeline event log for troubleshooting data quality issues.

Databricks Certified Data Engineer Associate

4.3 Pipeline Monitoring, Alerting, and Error Handling

Key Takeaways

Pipeline Monitoring, Alerting, and Error Handling

Lakeflow Declarative Pipeline Monitoring

Pipeline UI Dashboard

Event Log

Expectation Monitoring

Databricks SQL Alerts

Creating an Alert

Common Alert Scenarios

Job Notifications

Notification Triggers

Notification Destinations

Error Handling Best Practices

Retry Configuration

Handling Failures in Notebooks

Dead-Letter Pattern

Streaming Monitoring

Key Streaming Metrics

Databricks Certified Data Engineer Associate

1Introduction

2Domain 1: Databricks Intelligence Platform (10%)

3Domain 2: Development and Ingestion (30%)

4Domain 3: Data Processing & Transformations (31%)

5Domain 4: Productionizing Data Pipelines (18%)

6Domain 5: Data Governance & Quality (11%)

4.3 Pipeline Monitoring, Alerting, and Error Handling

Key Takeaways

Pipeline Monitoring, Alerting, and Error Handling

Lakeflow Declarative Pipeline Monitoring

Pipeline UI Dashboard

Event Log

Expectation Monitoring

Databricks SQL Alerts

Creating an Alert

Common Alert Scenarios

Job Notifications

Notification Triggers

Notification Destinations

Error Handling Best Practices

Retry Configuration

Handling Failures in Notebooks

Dead-Letter Pattern

Streaming Monitoring

Key Streaming Metrics