Which compute option is most cost-effective for running automated production jobs in Databricks?

Job clusters because they are created for the run and terminated after completion. Job clusters are the most cost-effective for production jobs because they are ephemeral — created when the job starts and automatically terminated when it completes. You pay only for the time the job is running, unlike all-purpose clusters which incur cost even when idle.

How can a downstream task in a Databricks job access a value computed by an upstream task?

By using dbutils.jobs.taskValues.get() to retrieve values set by the upstream task. dbutils.jobs.taskValues provides set() and get() methods for passing values between tasks. The upstream task calls set(key, value) and the downstream task calls get(taskKey="upstream_task_name", key="key_name") to retrieve it.

A data engineer wants a job to run automatically whenever new files arrive in a cloud storage bucket. Which trigger type should they configure?

File arrival trigger. The file arrival trigger starts a job run when new files are detected in a specified cloud storage location. It waits for a configurable period after the last file arrival before triggering, ensuring all related files are present.

What is the purpose of the "Run Job" task type in Databricks Workflows?

To trigger another previously defined job as a task, enabling modular and reusable workflows. The "Run Job" task type lets you call another job as a task within your workflow. This enables modular orchestration where common jobs (e.g., data quality checks, notifications) can be reused across multiple workflows without duplication.

Databricks Workflows and Lakeflow Jobs

Quick Answer: Lakeflow Jobs orchestrate multi-task pipelines as DAGs. Tasks can be notebooks, SQL, Python, DLT pipelines, or other jobs. Triggers include scheduled (cron), file arrival, and API. Use job clusters for cost-efficient automated workloads.

Core Concepts

Concept	Definition
Job	A named unit of orchestration containing one or more tasks
Task	A single unit of work (notebook, SQL query, DLT pipeline, etc.)
Run	A specific execution instance of a job
Trigger	The mechanism that starts a job run (schedule, event, API)
DAG	The directed acyclic graph of task dependencies

Task Types

Task Type	Description	Use Case
Notebook	Run a Databricks notebook	Custom ETL logic
SQL	Run a SQL query or dashboard	Data transformations, quality checks
Python script	Run a Python file	Utility scripts, API calls
Lakeflow Pipeline	Trigger a Declarative Pipeline update	DLT pipeline execution
dbt	Run dbt transformations	dbt Core integration
JAR	Run a Java/Scala JAR	Legacy Spark applications
Run Job	Trigger another job as a task	Modular, reusable workflows
If/Else	Conditional branching	Dynamic pipeline logic
For Each	Loop over a list of values	Parameterized batch processing

Task Dependencies

Job: Daily ETL Pipeline
├── Task 1: Ingest raw data (notebook)
├── Task 2: Run DLT pipeline (depends on Task 1)
├── Task 3: Run quality checks (depends on Task 2)
│   ├── Task 4a: Update dashboards (depends on Task 3, if checks pass)
│   └── Task 4b: Send alert (depends on Task 3, if checks fail)

Dependencies are defined between tasks:

A task runs only after all its dependencies complete successfully
Tasks without dependencies on each other run in parallel
Failed tasks can be configured to retry, skip downstream, or fail the job

Triggers and Scheduling

Scheduled Triggers (Cron)

# Run every day at 2 AM UTC
Schedule: 0 2 * * *

# Run every Monday at 6 AM UTC
Schedule: 0 6 * * 1

# Run every hour
Schedule: 0 * * * *

File Arrival Trigger

# Trigger job when new files arrive in a cloud storage location
Trigger type: File arrival
Storage location: s3://my-bucket/data/incoming/
Wait time: 60 seconds (how long to wait after last file arrival)

Continuous Trigger

# Immediately start a new run when the previous one completes
Trigger type: Continuous

Manual and API Triggers

UI: Click "Run Now" in the Workflows interface
CLI: databricks jobs run-now --job-id 123
REST API: POST /api/2.1/jobs/run-now

Compute Configuration

Job Clusters (Recommended for Production)

Created when the job starts, terminated when it finishes
No cost when the job is not running
Can be shared across tasks within the same job
Configurable: node types, autoscaling, Spark version, libraries

All-Purpose Clusters

Persistent clusters for interactive development
More expensive for production jobs (paying for idle time)
Use only when multiple jobs need to share state or libraries

Serverless Compute

Databricks manages the infrastructure
Fastest startup time (no cluster provisioning delay)
Best for SQL-heavy and lightweight Python workloads

Task Parameters

# Access task parameters in a notebook
dbutils.widgets.get("start_date")  # Widget-based parameter
dbutils.jobs.taskValues.get("previous_task", "row_count")  # Task value from previous task

Passing Values Between Tasks

# Task 1: Set a task value
row_count = df.count()
dbutils.jobs.taskValues.set(key="row_count", value=row_count)

# Task 2: Read the task value from Task 1
count = dbutils.jobs.taskValues.get(taskKey="ingest_task", key="row_count")
if count == 0:
    dbutils.notebook.exit("No new data to process")

Error Handling and Retries

Setting	Description
Max retries	Number of times to retry a failed task (0 = no retry)
Min retry interval	Minimum wait time between retries
Timeout	Maximum time a task can run before being killed
On failure	Stop the job, continue with dependencies, or skip downstream

Notification Alerts

Jobs can send notifications on:

Job start — when a run begins
Job success — when all tasks complete successfully
Job failure — when any task fails (after retries)
Duration threshold — when a run exceeds expected time

On the Exam: Know the different task types, how to configure job clusters vs. all-purpose clusters, how task values pass data between tasks, and common trigger types (schedule, file arrival, continuous).

Databricks Certified Data Engineer Associate

4.1 Databricks Workflows and Lakeflow Jobs

Key Takeaways

Databricks Workflows and Lakeflow Jobs

Core Concepts

Task Types

Task Dependencies

Triggers and Scheduling

Scheduled Triggers (Cron)

File Arrival Trigger

Continuous Trigger

Manual and API Triggers

Compute Configuration

Job Clusters (Recommended for Production)

All-Purpose Clusters

Serverless Compute

Task Parameters

Passing Values Between Tasks

Error Handling and Retries

Notification Alerts

Databricks Certified Data Engineer Associate

1Introduction

2Domain 1: Databricks Intelligence Platform (10%)

3Domain 2: Development and Ingestion (30%)

4Domain 3: Data Processing & Transformations (31%)

5Domain 4: Productionizing Data Pipelines (18%)

6Domain 5: Data Governance & Quality (11%)

4.1 Databricks Workflows and Lakeflow Jobs

Key Takeaways

Databricks Workflows and Lakeflow Jobs

Core Concepts

Task Types

Task Dependencies

Triggers and Scheduling

Scheduled Triggers (Cron)

File Arrival Trigger

Continuous Trigger

Manual and API Triggers

Compute Configuration

Job Clusters (Recommended for Production)

All-Purpose Clusters

Serverless Compute

Task Parameters

Passing Values Between Tasks

Error Handling and Retries

Notification Alerts