2.1 Auto Loader (cloudFiles)
Key Takeaways
- Auto Loader is a Structured Streaming source that incrementally and efficiently processes new data files as they arrive in cloud storage.
- Auto Loader uses the cloudFiles format and supports JSON, CSV, Parquet, Avro, ORC, text, and binary file formats.
- Two file detection modes exist: directory listing (default, no setup) and file notification (event-based, lower latency and cost for large directories).
- Schema inference and evolution are built-in — Auto Loader can automatically detect and adapt to schema changes in source files.
- The rescued data column (_rescued_data) captures any data that does not match the expected schema instead of failing the entire load.
Auto Loader (cloudFiles)
Quick Answer: Auto Loader is a Structured Streaming source (format = "cloudFiles") that incrementally processes new files as they arrive in cloud storage. It supports schema inference, schema evolution, and rescued data columns. Use directory listing mode for simplicity or file notification mode for lower latency at scale.
What Is Auto Loader?
Auto Loader provides the most efficient way to incrementally ingest new data files from cloud storage (S3, ADLS, GCS) into Delta Lake tables. Given a source directory, Auto Loader automatically:
- Detects new files as they arrive
- Processes each file exactly once
- Handles schema changes gracefully
- Scales to millions of files
Basic Auto Loader Syntax
PySpark Syntax
# Read with Auto Loader
df = (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", "/checkpoints/schema")
.load("/data/raw/events/")
)
# Write to a Delta table
(df.writeStream
.option("checkpointLocation", "/checkpoints/events")
.trigger(availableNow=True)
.toTable("my_catalog.my_schema.raw_events")
)
SQL Syntax (with COPY INTO alternative)
-- COPY INTO is the SQL-based alternative to Auto Loader
COPY INTO my_catalog.my_schema.raw_events
FROM '/data/raw/events/'
FILEFORMAT = JSON
FORMAT_OPTIONS ('mergeSchema' = 'true')
COPY_OPTIONS ('mergeSchema' = 'true');
File Detection Modes
Auto Loader supports two modes for detecting new files:
| Mode | How It Works | Best For | Setup |
|---|---|---|---|
| Directory listing | Periodically lists all files in the source directory and identifies new ones | Small to medium directories; quick setup | Default — no extra configuration |
| File notification | Subscribes to cloud storage events (S3 events, Azure Event Grid, GCS Pub/Sub) | Large directories with millions of files; lower latency | Requires cloud event infrastructure setup |
Directory Listing Mode
# Directory listing is the default — no additional options needed
df = (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("cloudFiles.useNotifications", "false") # Default
.load("/data/raw/")
)
File Notification Mode
# File notification mode — lower latency for large directories
df = (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.useNotifications", "true")
.load("/data/raw/")
)
On the Exam: Know when to recommend each mode. Directory listing is simpler but slower for very large directories (millions of files). File notification has lower latency and lower cost per detection but requires event infrastructure.
Schema Inference and Evolution
Automatic Schema Inference
Auto Loader infers the schema from the first batch of files and stores it in the schemaLocation:
df = (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.inferColumnTypes", "true") # Infer types (default: strings)
.option("cloudFiles.schemaLocation", "/checkpoints/schema")
.load("/data/raw/")
)
Schema Evolution Modes
| Mode | Behavior |
|---|---|
| addNewColumns (default) | New columns are added to the schema automatically |
| rescue | New columns go to the _rescued_data column |
| failOnNewColumns | Stream fails if new columns are detected |
| none | New columns are silently ignored |
# Configure schema evolution behavior
df = (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaEvolutionMode", "addNewColumns")
.option("cloudFiles.schemaLocation", "/checkpoints/schema")
.load("/data/raw/")
)
Rescued Data Column
The rescued data column captures data that doesn't match the expected schema:
# The _rescued_data column is enabled by default
# It contains JSON strings of any data that could not be parsed
df = (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", "/checkpoints/schema")
.load("/data/raw/")
)
# Check rescued data
df.select("_rescued_data").filter("_rescued_data IS NOT NULL")
Data ends up in _rescued_data when:
- A column has a different data type than expected (e.g., string where int is expected)
- A new column appears that isn't in the schema (when mode is "rescue")
- A field has an unexpected nested structure
On the Exam: The rescued data column is a key differentiator for Auto Loader. Unlike COPY INTO which fails or drops bad records, Auto Loader captures them for later investigation.
Auto Loader vs. COPY INTO
| Feature | Auto Loader (cloudFiles) | COPY INTO |
|---|---|---|
| Type | Structured Streaming source | SQL command |
| Incremental | Yes — tracks processed files automatically | Yes — tracks via checksum (less efficient) |
| Schema inference | Built-in with schemaLocation | No — schema must be defined |
| Schema evolution | Built-in (addNewColumns, rescue, etc.) | Limited (mergeSchema option) |
| Rescued data | Built-in _rescued_data column | Not available |
| File detection | Directory listing or file notification | Directory listing only |
| Scalability | Millions of files | Thousands of files |
| Best for | Continuous or incremental ingestion | Ad-hoc or one-time loads |
On the Exam: Databricks recommends Auto Loader over COPY INTO for production ingestion workloads. COPY INTO is suitable for ad-hoc ingestion or when SQL-only access is required.
Which file detection mode should be used for an Auto Loader source directory containing millions of files that need low-latency detection?
Where does Auto Loader store data that does not match the expected schema during ingestion?
A data engineer needs to ingest CSV files from cloud storage into a Delta table on a one-time basis using only SQL. Which approach is most appropriate?
What is the default schema evolution mode for Auto Loader?