4.1 Databricks Workflows and Lakeflow Jobs
Key Takeaways
- Lakeflow Jobs orchestrate work using three concepts: jobs (the container), tasks (units of work shown as a DAG), and triggers (what starts the job).
- Tasks declare upstream dependencies and use the 'Run if' condition (e.g. All succeeded, At least one succeeded, All done) to control conditional and error-handling branches.
- Triggers can be scheduled (cron/quartz), continuous, file-arrival, or table-update; classic jobs default to no task retries while serverless jobs auto-optimize retries.
- Repair run reruns only the failed and downstream tasks of a multi-task job, reusing successful task output instead of rerunning the whole job.
- Job clusters are created for the run and terminated at the end, making them cheaper and more isolated than shared all-purpose clusters.
Jobs, Tasks, and Triggers
Lakeflow Jobs (the orchestrator formerly branded Databricks Workflows) is the native scheduler for the Databricks Data Intelligence Platform. The exam expects you to know three building blocks:
- A job is the top-level container that coordinates, schedules, and runs work. A job can be a single notebook or hundreds of tasks with branching logic.
- A task is one unit of work — a notebook, Python script, SQL file, Lakeflow Declarative Pipeline, dbt project, JAR, or another job (
Run Jobtask). Tasks are arranged as a Directed Acyclic Graph (DAG), meaning dependencies flow one direction and cannot cycle back. - A trigger is the mechanism that starts a run. It can be time-based (a schedule) or event-based (file arrival in cloud storage, or a Delta table update).
Because the task graph is a DAG, Databricks can run independent tasks in parallel and only blocks a task until its upstream dependencies finish.
Dependencies and the Run-if Condition
Each task lists its Depends on upstream tasks. By default a task runs only when all of its upstream dependencies succeed, but the Run if setting changes this so you can build error-handling and fan-in branches:
| Run if value | Task runs when… |
|---|---|
| All succeeded (default) | every upstream dependency succeeded |
| At least one succeeded | one or more upstream tasks succeeded |
| None failed | no upstream failed (skipped is OK) |
| All done | every upstream finished, regardless of outcome |
| At least one failed | one or more upstream failed (cleanup/alert branch) |
| All failed | every upstream dependency failed |
The All done and At least one failed values are how you wire a notification or rollback task that should fire even when an earlier task breaks. Lakeflow Jobs also supports If/else condition tasks and For each tasks to loop a task over a parameter list.
Triggers, Schedules, and Cron
A job's trigger type determines when it starts:
- Scheduled — a recurring time-based trigger. The UI builds the schedule, but it compiles to a quartz cron expression (for example
0 0 2 * * ?for 2:00 AM daily). You also pick a timezone, which matters for daylight-saving correctness. - Continuous — keeps one run active at all times, restarting automatically; used for always-on streaming. Continuous jobs cannot use a normal retry policy.
- File arrival — polls a Unity Catalog external location or volume path and starts a run when new files land, lowering latency versus a fixed schedule.
- Table update — triggers when a monitored Delta table changes.
If a scheduled run is still going when the next trigger fires, the concurrent runs limit and queueing settings decide whether the new run waits, is skipped, or runs in parallel.
Clusters, Retries, Notifications, and Repair
Compute choices. A task can run on serverless compute, a dedicated job cluster, or an existing all-purpose cluster. A job cluster spins up for the run and terminates when it ends — cheaper and more isolated, and billed at the lower Jobs Compute DBU rate. An all-purpose cluster stays running and is meant for interactive notebooks; reusing it for scheduled jobs costs more and risks resource contention.
Retries. Per task you set max_retries, a min_retry_interval between attempts, and retry on timeout. On classic compute the default is no retries; serverless jobs auto-optimize retries. A timeout marks a task Timed Out, and if you set both timeout and retries, the timeout applies to each attempt.
Notifications send email, Slack, or webhook alerts on start, success, failure, or duration-exceeded.
Repair run is exam-favorite: for a multi-task job, repair reruns only the failed tasks and their downstream tasks, reusing the output of tasks that already succeeded — far cheaper than rerunning everything.
Parameters, Git, and Task Values
Lakeflow Jobs supports job parameters — key/value pairs defined at the job level and automatically pushed down to every task, so one parameter (for example run_date) drives the whole DAG. Tasks read them with dbutils.widgets.get(...) in a notebook or as named arguments for a Python/SQL task. You can override parameters at trigger time, which is how the same job runs a backfill for an arbitrary date.
Git integration lets a job run notebooks or scripts directly from a remote Git repository (a specific branch, tag, or commit) instead of the workspace, so production runs exactly the reviewed code. This is the bridge between Jobs and the source-controlled, Asset-Bundle workflow.
Tasks can also pass small results downstream with task values (dbutils.jobs.taskValues.set/get), letting a control task compute a value (such as a row count) that a later If/else condition branches on.
Queueing and Concurrency
When runs overlap, two settings govern behavior: max concurrent runs caps how many runs of the same job execute at once, and queueing holds a triggered run until a slot frees instead of skipping it. For most ETL you set concurrent runs to 1 and enable queueing so a slow run does not drop the next scheduled load. Note that a continuous job ignores schedules and queueing because it always keeps one run alive, restarting on completion or failure. Choosing the right trigger, concurrency, and retry combination per job is the difference between a pipeline that quietly recovers and one that pages an engineer at 3 AM.
A Lakeflow Job has 12 tasks; task 7 fails after tasks 1-6 succeed. You fix the bug and want to resume without recomputing tasks 1-6. What should you use?
You want a cleanup task that sends an alert and tears down resources whenever an upstream ETL task FAILS, but is skipped when it succeeds. Which Run if condition fits?
Which statement about job clusters versus all-purpose clusters is correct?
A daily schedule of '0 0 2 * * ?' represents what, and what compiles the UI schedule?