2.1 Auto Loader (cloudFiles)

Key Takeaways

Auto Loader uses the cloudFiles source to incrementally ingest only new files from cloud storage, tracking processed files in a checkpoint so each file is processed exactly once.
Setting cloudFiles.schemaLocation enables schema inference and evolution; the default evolution mode addNewColumns stops the stream with UnknownFieldException and merges new columns on restart.
Auto Loader adds a _rescued_data column that captures values that do not match the inferred schema instead of dropping them.
File-notification mode scales to millions of files via cloud queues, while the default directory-listing mode lists the input path each micro-batch.
cloudFiles.format is required and names the underlying format (json, csv, parquet, avro, text, binaryFile).

Last updated: June 2026

What Auto Loader Solves

Auto Loader is Databricks' purpose-built tool for incrementally and efficiently processing new data files as they land in cloud object storage (Amazon S3, Azure Data Lake Storage Gen2, Google Cloud Storage). It is exposed through the Structured Streaming source named cloudFiles. The core problem it solves is idempotent incremental ingestion: when thousands of files accumulate in a landing zone, you do not want to re-read files you have already processed, and you do not want to maintain a brittle list of filenames yourself.

Auto Loader tracks which files it has already ingested by writing their identities to a checkpoint (an RocksDB-backed key-value store under checkpointLocation). On every micro-batch it discovers only the files that have appeared since the last run, guaranteeing exactly-once ingestion even across job restarts and failures. Because it is a streaming source, the same code works for continuous streaming, scheduled incremental batches, and one-shot backfills.

Basic Syntax and Required Options

In PySpark, Auto Loader is invoked through spark.readStream.format("cloudFiles"). Two options are effectively mandatory: cloudFiles.format (the underlying file format) and a schema source (cloudFiles.schemaLocation for inference, or an explicit .schema(...)).

df = (spark.readStream
  .format("cloudFiles")
  .option("cloudFiles.format", "json")
  .option("cloudFiles.schemaLocation", "/chk/schema")
  .load("/landing/events"))

(df.writeStream
  .option("checkpointLocation", "/chk/events")
  .trigger(availableNow=True)
  .toTable("main.bronze.events"))

In SQL, the equivalent is read_files(...), which is built on the same engine. Note that checkpointLocation (the write-side state) and cloudFiles.schemaLocation (the inferred-schema store) are distinct directories serving different purposes.

Option	Purpose
`cloudFiles.format`	Underlying format: json, csv, parquet, avro, text, binaryFile
`cloudFiles.schemaLocation`	Directory storing the inferred/evolved schema
`cloudFiles.schemaEvolutionMode`	How to react to new columns
`cloudFiles.useNotifications`	Toggle file-notification vs directory-listing mode
`cloudFiles.maxFilesPerTrigger`	Throttle files per micro-batch

Schema Inference, Evolution, and Rescued Data

Specifying cloudFiles.schemaLocation enables both inference and evolution. On first run, Auto Loader samples files to infer column names and types (by default it infers all columns as strings for text formats like JSON/CSV unless cloudFiles.inferColumnTypes is set to true). The cloudFiles.schemaEvolutionMode option controls behavior when a new column appears:

addNewColumns (default when a schema is not provided): the stream stops with an UnknownFieldException, merges the new column into the stored schema, and on the next restart the column is included. A streaming job in a workflow restarts automatically.
rescue: the stream never fails on new columns; unknown fields are captured in the _rescued_data column for later extraction.
failOnNewColumns: the stream fails and does not proceed until you manually update the schema.
none: ignore new columns entirely.

Regardless of mode, Auto Loader adds a _rescued_data column. Any value that does not conform to the schema (type mismatch, case mismatch, extra field) is rescued there as JSON rather than silently dropped — a key data-quality safeguard.

Directory Listing vs File Notification

Auto Loader can discover new files two ways:

Mode	How it works	Best for
Directory listing (default)	Lists the input directory each trigger	Smaller numbers of files; no cloud setup
File notification	Subscribes to a cloud queue/notification service fed by storage events	Millions of files; lower latency and cost at scale

File-notification mode (cloudFiles.useNotifications=true) provisions a notification service and queue (e.g., SNS + SQS on AWS) so Auto Loader is told about new files instead of repeatedly listing the directory. Databricks can configure these resources automatically when given the right permissions. For most pipelines, the default directory-listing mode is sufficient and requires no extra cloud configuration.

Exactly-Once Guarantees and Cost Efficiency

Auto Loader's incremental nature is what makes it both correct and cheap. Because the checkpoint records every file it has ingested, re-running the stream never re-reads old files, so each record is written exactly once into the target Delta table even after a crash mid-batch. This is fundamentally different from a naive spark.read.json("/landing/*"), which would reprocess the entire directory on every run, duplicating data and ballooning compute cost.

Several options tune throughput and cost:

cloudFiles.maxFilesPerTrigger caps how many files each micro-batch picks up, smoothing load and bounding memory.
cloudFiles.maxBytesPerTrigger does the same by data volume.
cloudFiles.includeExistingFiles controls whether files already present when the stream starts are ingested (default true) or only new arrivals are.
cloudFiles.allowOverwrites lets re-uploaded files with the same name be re-ingested.

A common production pattern is Auto Loader feeding a bronze raw-ingestion table with trigger(availableNow=True) on a schedule: the job wakes, ingests only the files that arrived since the last run, and shuts down — combining streaming's exactly-once bookkeeping with batch-job economics. Pairing this with addNewColumns evolution means the bronze table self-heals when upstream producers add fields, and _rescued_data ensures no malformed record is ever lost. This makes Auto Loader the recommended ingestion path over manual file listing or COPY INTO for high-volume landing zones.

By default Auto Loader infers columns as strings; set cloudFiles.inferColumnTypes=true to have it attempt richer type inference instead, and use cloudFiles.schemaHints to override specific column types without giving a full schema, which is the cleanest way to fix a single mis-inferred field (for example forcing a column to DECIMAL or TIMESTAMP). Because the schema lives in schemaLocation, deleting that directory resets inference, while deleting the checkpointLocation resets the file-tracking state and would cause already-ingested files to be reprocessed.

Test Your Knowledge

Which option must be set to enable schema inference and evolution in Auto Loader?

cloudFiles.useNotifications

checkpointLocation

cloudFiles.maxFilesPerTrigger

cloudFiles.schemaLocation

Test Your Knowledge

Using the default addNewColumns evolution mode, what happens the first time a previously unseen column appears in incoming files?

The new column is silently dropped from the output

The entire checkpoint is deleted and ingestion restarts from scratch

The stream stops with an UnknownFieldException, then resumes with the column added after restart

Auto Loader switches automatically to rescue mode permanently

Test Your Knowledge

What is the purpose of the _rescued_data column that Auto Loader adds?

It stores the file path and modification time of every ingested file

It buffers records that failed a Delta constraint check

It captures values that do not match the inferred schema instead of dropping them

It holds the checkpoint offsets for the stream

Test Your Knowledge

Which discovery mode is best suited for ingesting millions of files with the lowest listing cost?

File notification mode using a cloud queue

Directory listing mode

Trigger.Once batch mode

Static spark.read with a wildcard path

Up Next

2.2 Reading Data with Spark SQL and PySpark

Continue learning

Databricks Certified Data Engineer Associate

Databricks Certified Data Engineer Associate

2.1 Auto Loader (cloudFiles)

Key Takeaways

What Auto Loader Solves

Basic Syntax and Required Options

Schema Inference, Evolution, and Rescued Data

Directory Listing vs File Notification

Exactly-Once Guarantees and Cost Efficiency

Databricks Certified Data Engineer Associate

1Introduction

2Domain 1: Databricks Intelligence Platform (10%)

3Domain 2: Development and Ingestion (30%)

4Domain 3: Data Processing & Transformations (31%)

5Domain 4: Productionizing Data Pipelines (18%)

6Domain 5: Data Governance & Quality (11%)

Databricks Certified Data Engineer Associate

2.1 Auto Loader (cloudFiles)

Key Takeaways

What Auto Loader Solves

Basic Syntax and Required Options

Schema Inference, Evolution, and Rescued Data

Directory Listing vs File Notification

Exactly-Once Guarantees and Cost Efficiency