2.1 Auto Loader (cloudFiles)
Key Takeaways
- Auto Loader uses the cloudFiles source to incrementally ingest only new files from cloud storage, tracking processed files in a checkpoint so each file is processed exactly once.
- Setting cloudFiles.schemaLocation enables schema inference and evolution; the default evolution mode addNewColumns stops the stream with UnknownFieldException and merges new columns on restart.
- Auto Loader adds a _rescued_data column that captures values that do not match the inferred schema instead of dropping them.
- File-notification mode scales to millions of files via cloud queues, while the default directory-listing mode lists the input path each micro-batch.
- cloudFiles.format is required and names the underlying format (json, csv, parquet, avro, text, binaryFile).
What Auto Loader Solves
Auto Loader is Databricks' purpose-built tool for incrementally and efficiently processing new data files as they land in cloud object storage (Amazon S3, Azure Data Lake Storage Gen2, Google Cloud Storage). It is exposed through the Structured Streaming source named cloudFiles. The core problem it solves is idempotent incremental ingestion: when thousands of files accumulate in a landing zone, you do not want to re-read files you have already processed, and you do not want to maintain a brittle list of filenames yourself.
Auto Loader tracks which files it has already ingested by writing their identities to a checkpoint (an RocksDB-backed key-value store under checkpointLocation). On every micro-batch it discovers only the files that have appeared since the last run, guaranteeing exactly-once ingestion even across job restarts and failures. Because it is a streaming source, the same code works for continuous streaming, scheduled incremental batches, and one-shot backfills.
Basic Syntax and Required Options
In PySpark, Auto Loader is invoked through spark.readStream.format("cloudFiles"). Two options are effectively mandatory: cloudFiles.format (the underlying file format) and a schema source (cloudFiles.schemaLocation for inference, or an explicit .schema(...)).
df = (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", "/chk/schema")
.load("/landing/events"))
(df.writeStream
.option("checkpointLocation", "/chk/events")
.trigger(availableNow=True)
.toTable("main.bronze.events"))
In SQL, the equivalent is read_files(...), which is built on the same engine. Note that checkpointLocation (the write-side state) and cloudFiles.schemaLocation (the inferred-schema store) are distinct directories serving different purposes.
| Option | Purpose |
|---|---|
cloudFiles.format | Underlying format: json, csv, parquet, avro, text, binaryFile |
cloudFiles.schemaLocation | Directory storing the inferred/evolved schema |
cloudFiles.schemaEvolutionMode | How to react to new columns |
cloudFiles.useNotifications | Toggle file-notification vs directory-listing mode |
cloudFiles.maxFilesPerTrigger | Throttle files per micro-batch |
Schema Inference, Evolution, and Rescued Data
Specifying cloudFiles.schemaLocation enables both inference and evolution. On first run, Auto Loader samples files to infer column names and types (by default it infers all columns as strings for text formats like JSON/CSV unless cloudFiles.inferColumnTypes is set to true). The cloudFiles.schemaEvolutionMode option controls behavior when a new column appears:
addNewColumns(default when a schema is not provided): the stream stops with anUnknownFieldException, merges the new column into the stored schema, and on the next restart the column is included. A streaming job in a workflow restarts automatically.rescue: the stream never fails on new columns; unknown fields are captured in the_rescued_datacolumn for later extraction.failOnNewColumns: the stream fails and does not proceed until you manually update the schema.none: ignore new columns entirely.
Regardless of mode, Auto Loader adds a _rescued_data column. Any value that does not conform to the schema (type mismatch, case mismatch, extra field) is rescued there as JSON rather than silently dropped — a key data-quality safeguard.
Directory Listing vs File Notification
Auto Loader can discover new files two ways:
| Mode | How it works | Best for |
|---|---|---|
| Directory listing (default) | Lists the input directory each trigger | Smaller numbers of files; no cloud setup |
| File notification | Subscribes to a cloud queue/notification service fed by storage events | Millions of files; lower latency and cost at scale |
File-notification mode (cloudFiles.useNotifications=true) provisions a notification service and queue (e.g., SNS + SQS on AWS) so Auto Loader is told about new files instead of repeatedly listing the directory. Databricks can configure these resources automatically when given the right permissions. For most pipelines, the default directory-listing mode is sufficient and requires no extra cloud configuration.
Exactly-Once Guarantees and Cost Efficiency
Auto Loader's incremental nature is what makes it both correct and cheap. Because the checkpoint records every file it has ingested, re-running the stream never re-reads old files, so each record is written exactly once into the target Delta table even after a crash mid-batch. This is fundamentally different from a naive spark.read.json("/landing/*"), which would reprocess the entire directory on every run, duplicating data and ballooning compute cost.
Several options tune throughput and cost:
cloudFiles.maxFilesPerTriggercaps how many files each micro-batch picks up, smoothing load and bounding memory.cloudFiles.maxBytesPerTriggerdoes the same by data volume.cloudFiles.includeExistingFilescontrols whether files already present when the stream starts are ingested (default true) or only new arrivals are.cloudFiles.allowOverwriteslets re-uploaded files with the same name be re-ingested.
A common production pattern is Auto Loader feeding a bronze raw-ingestion table with trigger(availableNow=True) on a schedule: the job wakes, ingests only the files that arrived since the last run, and shuts down — combining streaming's exactly-once bookkeeping with batch-job economics. Pairing this with addNewColumns evolution means the bronze table self-heals when upstream producers add fields, and _rescued_data ensures no malformed record is ever lost. This makes Auto Loader the recommended ingestion path over manual file listing or COPY INTO for high-volume landing zones.
By default Auto Loader infers columns as strings; set cloudFiles.inferColumnTypes=true to have it attempt richer type inference instead, and use cloudFiles.schemaHints to override specific column types without giving a full schema, which is the cleanest way to fix a single mis-inferred field (for example forcing a column to DECIMAL or TIMESTAMP). Because the schema lives in schemaLocation, deleting that directory resets inference, while deleting the checkpointLocation resets the file-tracking state and would cause already-ingested files to be reprocessed.
Which option must be set to enable schema inference and evolution in Auto Loader?
Using the default addNewColumns evolution mode, what happens the first time a previously unseen column appears in incoming files?
What is the purpose of the _rescued_data column that Auto Loader adds?
Which discovery mode is best suited for ingesting millions of files with the lowest listing cost?