4.1 Databricks Workflows and Lakeflow Jobs
Key Takeaways
- Lakeflow Jobs (Databricks Workflows) orchestrate multi-task data pipelines using a DAG (Directed Acyclic Graph) of tasks with dependencies.
- Task types include notebooks, SQL queries, Python scripts, Lakeflow Declarative Pipelines, dbt, JAR files, and "Run Job" for modular orchestration.
- Jobs can be triggered on a schedule (cron), manually, via API, or by file arrival events.
- Job clusters are ephemeral compute resources created for a job run and terminated afterward, reducing cost compared to all-purpose clusters.
- Task parameters, conditional tasks (if/else), and for-each loops enable dynamic and flexible pipeline orchestration.
Databricks Workflows and Lakeflow Jobs
Quick Answer: Lakeflow Jobs orchestrate multi-task pipelines as DAGs. Tasks can be notebooks, SQL, Python, DLT pipelines, or other jobs. Triggers include scheduled (cron), file arrival, and API. Use job clusters for cost-efficient automated workloads.
Core Concepts
| Concept | Definition |
|---|---|
| Job | A named unit of orchestration containing one or more tasks |
| Task | A single unit of work (notebook, SQL query, DLT pipeline, etc.) |
| Run | A specific execution instance of a job |
| Trigger | The mechanism that starts a job run (schedule, event, API) |
| DAG | The directed acyclic graph of task dependencies |
Task Types
| Task Type | Description | Use Case |
|---|---|---|
| Notebook | Run a Databricks notebook | Custom ETL logic |
| SQL | Run a SQL query or dashboard | Data transformations, quality checks |
| Python script | Run a Python file | Utility scripts, API calls |
| Lakeflow Pipeline | Trigger a Declarative Pipeline update | DLT pipeline execution |
| dbt | Run dbt transformations | dbt Core integration |
| JAR | Run a Java/Scala JAR | Legacy Spark applications |
| Run Job | Trigger another job as a task | Modular, reusable workflows |
| If/Else | Conditional branching | Dynamic pipeline logic |
| For Each | Loop over a list of values | Parameterized batch processing |
Task Dependencies
Job: Daily ETL Pipeline
├── Task 1: Ingest raw data (notebook)
├── Task 2: Run DLT pipeline (depends on Task 1)
├── Task 3: Run quality checks (depends on Task 2)
│ ├── Task 4a: Update dashboards (depends on Task 3, if checks pass)
│ └── Task 4b: Send alert (depends on Task 3, if checks fail)
Dependencies are defined between tasks:
- A task runs only after all its dependencies complete successfully
- Tasks without dependencies on each other run in parallel
- Failed tasks can be configured to retry, skip downstream, or fail the job
Triggers and Scheduling
Scheduled Triggers (Cron)
# Run every day at 2 AM UTC
Schedule: 0 2 * * *
# Run every Monday at 6 AM UTC
Schedule: 0 6 * * 1
# Run every hour
Schedule: 0 * * * *
File Arrival Trigger
# Trigger job when new files arrive in a cloud storage location
Trigger type: File arrival
Storage location: s3://my-bucket/data/incoming/
Wait time: 60 seconds (how long to wait after last file arrival)
Continuous Trigger
# Immediately start a new run when the previous one completes
Trigger type: Continuous
Manual and API Triggers
- UI: Click "Run Now" in the Workflows interface
- CLI:
databricks jobs run-now --job-id 123 - REST API:
POST /api/2.1/jobs/run-now
Compute Configuration
Job Clusters (Recommended for Production)
- Created when the job starts, terminated when it finishes
- No cost when the job is not running
- Can be shared across tasks within the same job
- Configurable: node types, autoscaling, Spark version, libraries
All-Purpose Clusters
- Persistent clusters for interactive development
- More expensive for production jobs (paying for idle time)
- Use only when multiple jobs need to share state or libraries
Serverless Compute
- Databricks manages the infrastructure
- Fastest startup time (no cluster provisioning delay)
- Best for SQL-heavy and lightweight Python workloads
Task Parameters
# Access task parameters in a notebook
dbutils.widgets.get("start_date") # Widget-based parameter
dbutils.jobs.taskValues.get("previous_task", "row_count") # Task value from previous task
Passing Values Between Tasks
# Task 1: Set a task value
row_count = df.count()
dbutils.jobs.taskValues.set(key="row_count", value=row_count)
# Task 2: Read the task value from Task 1
count = dbutils.jobs.taskValues.get(taskKey="ingest_task", key="row_count")
if count == 0:
dbutils.notebook.exit("No new data to process")
Error Handling and Retries
| Setting | Description |
|---|---|
| Max retries | Number of times to retry a failed task (0 = no retry) |
| Min retry interval | Minimum wait time between retries |
| Timeout | Maximum time a task can run before being killed |
| On failure | Stop the job, continue with dependencies, or skip downstream |
Notification Alerts
Jobs can send notifications on:
- Job start — when a run begins
- Job success — when all tasks complete successfully
- Job failure — when any task fails (after retries)
- Duration threshold — when a run exceeds expected time
On the Exam: Know the different task types, how to configure job clusters vs. all-purpose clusters, how task values pass data between tasks, and common trigger types (schedule, file arrival, continuous).
Which compute option is most cost-effective for running automated production jobs in Databricks?
How can a downstream task in a Databricks job access a value computed by an upstream task?
A data engineer wants a job to run automatically whenever new files arrive in a cloud storage bucket. Which trigger type should they configure?
What is the purpose of the "Run Job" task type in Databricks Workflows?