Which file detection mode should be used for an Auto Loader source directory containing millions of files that need low-latency detection?

File notification mode. File notification mode subscribes to cloud storage events (S3 events, Azure Event Grid, GCS Pub/Sub) for event-driven file detection. It provides lower latency and is more cost-effective for directories with millions of files, compared to directory listing which must scan the entire directory.

Where does Auto Loader store data that does not match the expected schema during ingestion?

In the _rescued_data column of the target table. Auto Loader's rescued data column (_rescued_data) captures any records or fields that cannot be parsed according to the expected schema. This allows the pipeline to continue processing while preserving problematic data for later investigation.

A data engineer needs to ingest CSV files from cloud storage into a Delta table on a one-time basis using only SQL. Which approach is most appropriate?

COPY INTO command. COPY INTO is the SQL-based command for ingesting files into Delta tables. It is most appropriate for ad-hoc or one-time loads, especially when SQL-only access is required. Auto Loader is better for continuous or incremental production ingestion.

What is the default schema evolution mode for Auto Loader?

addNewColumns. The default schema evolution mode for Auto Loader is addNewColumns, which automatically adds new columns to the schema when they are detected in incoming data. Other modes include rescue (route to _rescued_data), failOnNewColumns (fail the stream), and none (silently ignore).

Auto Loader (cloudFiles)

Quick Answer: Auto Loader is a Structured Streaming source (format = "cloudFiles") that incrementally processes new files as they arrive in cloud storage. It supports schema inference, schema evolution, and rescued data columns. Use directory listing mode for simplicity or file notification mode for lower latency at scale.

What Is Auto Loader?

Auto Loader provides the most efficient way to incrementally ingest new data files from cloud storage (S3, ADLS, GCS) into Delta Lake tables. Given a source directory, Auto Loader automatically:

Detects new files as they arrive
Processes each file exactly once
Handles schema changes gracefully
Scales to millions of files

Basic Auto Loader Syntax

PySpark Syntax

# Read with Auto Loader
df = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", "/checkpoints/schema")
    .load("/data/raw/events/")
)

# Write to a Delta table
(df.writeStream
    .option("checkpointLocation", "/checkpoints/events")
    .trigger(availableNow=True)
    .toTable("my_catalog.my_schema.raw_events")
)

SQL Syntax (with COPY INTO alternative)

-- COPY INTO is the SQL-based alternative to Auto Loader
COPY INTO my_catalog.my_schema.raw_events
FROM '/data/raw/events/'
FILEFORMAT = JSON
FORMAT_OPTIONS ('mergeSchema' = 'true')
COPY_OPTIONS ('mergeSchema' = 'true');

File Detection Modes

Auto Loader supports two modes for detecting new files:

Mode	How It Works	Best For	Setup
Directory listing	Periodically lists all files in the source directory and identifies new ones	Small to medium directories; quick setup	Default — no extra configuration
File notification	Subscribes to cloud storage events (S3 events, Azure Event Grid, GCS Pub/Sub)	Large directories with millions of files; lower latency	Requires cloud event infrastructure setup

Directory Listing Mode

# Directory listing is the default — no additional options needed
df = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "csv")
    .option("cloudFiles.useNotifications", "false")  # Default
    .load("/data/raw/")
)

File Notification Mode

# File notification mode — lower latency for large directories
df = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.useNotifications", "true")
    .load("/data/raw/")
)

On the Exam: Know when to recommend each mode. Directory listing is simpler but slower for very large directories (millions of files). File notification has lower latency and lower cost per detection but requires event infrastructure.

Schema Inference and Evolution

Automatic Schema Inference

Auto Loader infers the schema from the first batch of files and stores it in the schemaLocation:

df = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.inferColumnTypes", "true")  # Infer types (default: strings)
    .option("cloudFiles.schemaLocation", "/checkpoints/schema")
    .load("/data/raw/")
)

Schema Evolution Modes

Mode	Behavior
addNewColumns (default)	New columns are added to the schema automatically
rescue	New columns go to the `_rescued_data` column
failOnNewColumns	Stream fails if new columns are detected
none	New columns are silently ignored

# Configure schema evolution behavior
df = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaEvolutionMode", "addNewColumns")
    .option("cloudFiles.schemaLocation", "/checkpoints/schema")
    .load("/data/raw/")
)

Rescued Data Column

The rescued data column captures data that doesn't match the expected schema:

# The _rescued_data column is enabled by default
# It contains JSON strings of any data that could not be parsed
df = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", "/checkpoints/schema")
    .load("/data/raw/")
)

# Check rescued data
df.select("_rescued_data").filter("_rescued_data IS NOT NULL")

Data ends up in _rescued_data when:

A column has a different data type than expected (e.g., string where int is expected)
A new column appears that isn't in the schema (when mode is "rescue")
A field has an unexpected nested structure

On the Exam: The rescued data column is a key differentiator for Auto Loader. Unlike COPY INTO which fails or drops bad records, Auto Loader captures them for later investigation.

Auto Loader vs. COPY INTO

Feature	Auto Loader (cloudFiles)	COPY INTO
Type	Structured Streaming source	SQL command
Incremental	Yes — tracks processed files automatically	Yes — tracks via checksum (less efficient)
Schema inference	Built-in with schemaLocation	No — schema must be defined
Schema evolution	Built-in (addNewColumns, rescue, etc.)	Limited (mergeSchema option)
Rescued data	Built-in _rescued_data column	Not available
File detection	Directory listing or file notification	Directory listing only
Scalability	Millions of files	Thousands of files
Best for	Continuous or incremental ingestion	Ad-hoc or one-time loads

On the Exam: Databricks recommends Auto Loader over COPY INTO for production ingestion workloads. COPY INTO is suitable for ad-hoc ingestion or when SQL-only access is required.

Databricks Certified Data Engineer Associate

2.1 Auto Loader (cloudFiles)

Key Takeaways

Auto Loader (cloudFiles)

What Is Auto Loader?

Basic Auto Loader Syntax

PySpark Syntax

SQL Syntax (with COPY INTO alternative)

File Detection Modes

Directory Listing Mode

File Notification Mode

Schema Inference and Evolution

Automatic Schema Inference

Schema Evolution Modes

Rescued Data Column

Auto Loader vs. COPY INTO

Databricks Certified Data Engineer Associate

1Introduction

2Domain 1: Databricks Intelligence Platform (10%)

3Domain 2: Development and Ingestion (30%)

4Domain 3: Data Processing & Transformations (31%)

5Domain 4: Productionizing Data Pipelines (18%)

6Domain 5: Data Governance & Quality (11%)

2.1 Auto Loader (cloudFiles)

Key Takeaways

Auto Loader (cloudFiles)

What Is Auto Loader?

Basic Auto Loader Syntax

PySpark Syntax

SQL Syntax (with COPY INTO alternative)

File Detection Modes

Directory Listing Mode

File Notification Mode

Schema Inference and Evolution

Automatic Schema Inference

Schema Evolution Modes

Rescued Data Column

Auto Loader vs. COPY INTO