2.1 Auto Loader (cloudFiles)

Key Takeaways

  • Auto Loader is a Structured Streaming source that incrementally and efficiently processes new data files as they arrive in cloud storage.
  • Auto Loader uses the cloudFiles format and supports JSON, CSV, Parquet, Avro, ORC, text, and binary file formats.
  • Two file detection modes exist: directory listing (default, no setup) and file notification (event-based, lower latency and cost for large directories).
  • Schema inference and evolution are built-in — Auto Loader can automatically detect and adapt to schema changes in source files.
  • The rescued data column (_rescued_data) captures any data that does not match the expected schema instead of failing the entire load.
Last updated: March 2026

Auto Loader (cloudFiles)

Quick Answer: Auto Loader is a Structured Streaming source (format = "cloudFiles") that incrementally processes new files as they arrive in cloud storage. It supports schema inference, schema evolution, and rescued data columns. Use directory listing mode for simplicity or file notification mode for lower latency at scale.

What Is Auto Loader?

Auto Loader provides the most efficient way to incrementally ingest new data files from cloud storage (S3, ADLS, GCS) into Delta Lake tables. Given a source directory, Auto Loader automatically:

  • Detects new files as they arrive
  • Processes each file exactly once
  • Handles schema changes gracefully
  • Scales to millions of files

Basic Auto Loader Syntax

PySpark Syntax

# Read with Auto Loader
df = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", "/checkpoints/schema")
    .load("/data/raw/events/")
)

# Write to a Delta table
(df.writeStream
    .option("checkpointLocation", "/checkpoints/events")
    .trigger(availableNow=True)
    .toTable("my_catalog.my_schema.raw_events")
)

SQL Syntax (with COPY INTO alternative)

-- COPY INTO is the SQL-based alternative to Auto Loader
COPY INTO my_catalog.my_schema.raw_events
FROM '/data/raw/events/'
FILEFORMAT = JSON
FORMAT_OPTIONS ('mergeSchema' = 'true')
COPY_OPTIONS ('mergeSchema' = 'true');

File Detection Modes

Auto Loader supports two modes for detecting new files:

ModeHow It WorksBest ForSetup
Directory listingPeriodically lists all files in the source directory and identifies new onesSmall to medium directories; quick setupDefault — no extra configuration
File notificationSubscribes to cloud storage events (S3 events, Azure Event Grid, GCS Pub/Sub)Large directories with millions of files; lower latencyRequires cloud event infrastructure setup

Directory Listing Mode

# Directory listing is the default — no additional options needed
df = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "csv")
    .option("cloudFiles.useNotifications", "false")  # Default
    .load("/data/raw/")
)

File Notification Mode

# File notification mode — lower latency for large directories
df = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.useNotifications", "true")
    .load("/data/raw/")
)

On the Exam: Know when to recommend each mode. Directory listing is simpler but slower for very large directories (millions of files). File notification has lower latency and lower cost per detection but requires event infrastructure.

Schema Inference and Evolution

Automatic Schema Inference

Auto Loader infers the schema from the first batch of files and stores it in the schemaLocation:

df = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.inferColumnTypes", "true")  # Infer types (default: strings)
    .option("cloudFiles.schemaLocation", "/checkpoints/schema")
    .load("/data/raw/")
)

Schema Evolution Modes

ModeBehavior
addNewColumns (default)New columns are added to the schema automatically
rescueNew columns go to the _rescued_data column
failOnNewColumnsStream fails if new columns are detected
noneNew columns are silently ignored
# Configure schema evolution behavior
df = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaEvolutionMode", "addNewColumns")
    .option("cloudFiles.schemaLocation", "/checkpoints/schema")
    .load("/data/raw/")
)

Rescued Data Column

The rescued data column captures data that doesn't match the expected schema:

# The _rescued_data column is enabled by default
# It contains JSON strings of any data that could not be parsed
df = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", "/checkpoints/schema")
    .load("/data/raw/")
)

# Check rescued data
df.select("_rescued_data").filter("_rescued_data IS NOT NULL")

Data ends up in _rescued_data when:

  • A column has a different data type than expected (e.g., string where int is expected)
  • A new column appears that isn't in the schema (when mode is "rescue")
  • A field has an unexpected nested structure

On the Exam: The rescued data column is a key differentiator for Auto Loader. Unlike COPY INTO which fails or drops bad records, Auto Loader captures them for later investigation.

Auto Loader vs. COPY INTO

FeatureAuto Loader (cloudFiles)COPY INTO
TypeStructured Streaming sourceSQL command
IncrementalYes — tracks processed files automaticallyYes — tracks via checksum (less efficient)
Schema inferenceBuilt-in with schemaLocationNo — schema must be defined
Schema evolutionBuilt-in (addNewColumns, rescue, etc.)Limited (mergeSchema option)
Rescued dataBuilt-in _rescued_data columnNot available
File detectionDirectory listing or file notificationDirectory listing only
ScalabilityMillions of filesThousands of files
Best forContinuous or incremental ingestionAd-hoc or one-time loads

On the Exam: Databricks recommends Auto Loader over COPY INTO for production ingestion workloads. COPY INTO is suitable for ad-hoc ingestion or when SQL-only access is required.

Test Your Knowledge

Which file detection mode should be used for an Auto Loader source directory containing millions of files that need low-latency detection?

A
B
C
D
Test Your Knowledge

Where does Auto Loader store data that does not match the expected schema during ingestion?

A
B
C
D
Test Your Knowledge

A data engineer needs to ingest CSV files from cloud storage into a Delta table on a one-time basis using only SQL. Which approach is most appropriate?

A
B
C
D
Test Your Knowledge

What is the default schema evolution mode for Auto Loader?

A
B
C
D