4.1 Databricks Workflows and Lakeflow Jobs

Key Takeaways

  • Lakeflow Jobs (Databricks Workflows) orchestrate multi-task data pipelines using a DAG (Directed Acyclic Graph) of tasks with dependencies.
  • Task types include notebooks, SQL queries, Python scripts, Lakeflow Declarative Pipelines, dbt, JAR files, and "Run Job" for modular orchestration.
  • Jobs can be triggered on a schedule (cron), manually, via API, or by file arrival events.
  • Job clusters are ephemeral compute resources created for a job run and terminated afterward, reducing cost compared to all-purpose clusters.
  • Task parameters, conditional tasks (if/else), and for-each loops enable dynamic and flexible pipeline orchestration.
Last updated: March 2026

Databricks Workflows and Lakeflow Jobs

Quick Answer: Lakeflow Jobs orchestrate multi-task pipelines as DAGs. Tasks can be notebooks, SQL, Python, DLT pipelines, or other jobs. Triggers include scheduled (cron), file arrival, and API. Use job clusters for cost-efficient automated workloads.

Core Concepts

ConceptDefinition
JobA named unit of orchestration containing one or more tasks
TaskA single unit of work (notebook, SQL query, DLT pipeline, etc.)
RunA specific execution instance of a job
TriggerThe mechanism that starts a job run (schedule, event, API)
DAGThe directed acyclic graph of task dependencies

Task Types

Task TypeDescriptionUse Case
NotebookRun a Databricks notebookCustom ETL logic
SQLRun a SQL query or dashboardData transformations, quality checks
Python scriptRun a Python fileUtility scripts, API calls
Lakeflow PipelineTrigger a Declarative Pipeline updateDLT pipeline execution
dbtRun dbt transformationsdbt Core integration
JARRun a Java/Scala JARLegacy Spark applications
Run JobTrigger another job as a taskModular, reusable workflows
If/ElseConditional branchingDynamic pipeline logic
For EachLoop over a list of valuesParameterized batch processing

Task Dependencies

Job: Daily ETL Pipeline
├── Task 1: Ingest raw data (notebook)
├── Task 2: Run DLT pipeline (depends on Task 1)
├── Task 3: Run quality checks (depends on Task 2)
│   ├── Task 4a: Update dashboards (depends on Task 3, if checks pass)
│   └── Task 4b: Send alert (depends on Task 3, if checks fail)

Dependencies are defined between tasks:

  • A task runs only after all its dependencies complete successfully
  • Tasks without dependencies on each other run in parallel
  • Failed tasks can be configured to retry, skip downstream, or fail the job

Triggers and Scheduling

Scheduled Triggers (Cron)

# Run every day at 2 AM UTC
Schedule: 0 2 * * *

# Run every Monday at 6 AM UTC
Schedule: 0 6 * * 1

# Run every hour
Schedule: 0 * * * *

File Arrival Trigger

# Trigger job when new files arrive in a cloud storage location
Trigger type: File arrival
Storage location: s3://my-bucket/data/incoming/
Wait time: 60 seconds (how long to wait after last file arrival)

Continuous Trigger

# Immediately start a new run when the previous one completes
Trigger type: Continuous

Manual and API Triggers

  • UI: Click "Run Now" in the Workflows interface
  • CLI: databricks jobs run-now --job-id 123
  • REST API: POST /api/2.1/jobs/run-now

Compute Configuration

Job Clusters (Recommended for Production)

  • Created when the job starts, terminated when it finishes
  • No cost when the job is not running
  • Can be shared across tasks within the same job
  • Configurable: node types, autoscaling, Spark version, libraries

All-Purpose Clusters

  • Persistent clusters for interactive development
  • More expensive for production jobs (paying for idle time)
  • Use only when multiple jobs need to share state or libraries

Serverless Compute

  • Databricks manages the infrastructure
  • Fastest startup time (no cluster provisioning delay)
  • Best for SQL-heavy and lightweight Python workloads

Task Parameters

# Access task parameters in a notebook
dbutils.widgets.get("start_date")  # Widget-based parameter
dbutils.jobs.taskValues.get("previous_task", "row_count")  # Task value from previous task

Passing Values Between Tasks

# Task 1: Set a task value
row_count = df.count()
dbutils.jobs.taskValues.set(key="row_count", value=row_count)

# Task 2: Read the task value from Task 1
count = dbutils.jobs.taskValues.get(taskKey="ingest_task", key="row_count")
if count == 0:
    dbutils.notebook.exit("No new data to process")

Error Handling and Retries

SettingDescription
Max retriesNumber of times to retry a failed task (0 = no retry)
Min retry intervalMinimum wait time between retries
TimeoutMaximum time a task can run before being killed
On failureStop the job, continue with dependencies, or skip downstream

Notification Alerts

Jobs can send notifications on:

  • Job start — when a run begins
  • Job success — when all tasks complete successfully
  • Job failure — when any task fails (after retries)
  • Duration threshold — when a run exceeds expected time

On the Exam: Know the different task types, how to configure job clusters vs. all-purpose clusters, how task values pass data between tasks, and common trigger types (schedule, file arrival, continuous).

Loading diagram...
Lakeflow Job DAG Example
Test Your Knowledge

Which compute option is most cost-effective for running automated production jobs in Databricks?

A
B
C
D
Test Your Knowledge

How can a downstream task in a Databricks job access a value computed by an upstream task?

A
B
C
D
Test Your Knowledge

A data engineer wants a job to run automatically whenever new files arrive in a cloud storage bucket. Which trigger type should they configure?

A
B
C
D
Test Your Knowledge

What is the purpose of the "Run Job" task type in Databricks Workflows?

A
B
C
D