2.9 Data Extraction from External Sources

Key Takeaways

JDBC reads pull from relational databases via .format('jdbc') with url, dbtable (or query), user, and password; partitionColumn/lowerBound/upperBound/numPartitions parallelize the read.
COPY INTO is an idempotent SQL command that incrementally loads files into a Delta table, skipping already-loaded files.
Cloud storage is accessed through Unity Catalog external locations and volumes rather than legacy DBFS mounts.
Lakehouse Federation queries external databases (PostgreSQL, MySQL, Snowflake, Redshift) in place via foreign catalogs without copying data.
Secrets for source credentials should come from dbutils.secrets / secret scopes, never hard-coded in notebooks.

Last updated: June 2026

JDBC: Reading Relational Databases

To pull data from an external relational database (PostgreSQL, MySQL, SQL Server, Oracle), Spark uses the JDBC data source. You supply a connection URL, a table or query, and credentials:

df = (spark.read.format("jdbc")
  .option("url", "jdbc:postgresql://host:5432/sales")
  .option("dbtable", "public.orders")
  .option("user", user)
  .option("password", pw)
  .load())

By default a JDBC read uses a single connection, so the whole table streams through one task — a bottleneck for large tables. To parallelize, provide partitionColumn, lowerBound, upperBound, and numPartitions; Spark then issues numPartitions range queries in parallel. Use a query option (a subquery) to push filtering and joins down to the source database. Credentials must come from dbutils.secrets / a secret scope — never hard-code passwords in a notebook.

COPY INTO, Cloud Storage, and Lakehouse Federation

Several other extraction paths appear on the exam:

Mechanism	What it does
COPY INTO	Idempotently loads files into a Delta table, skipping files already ingested
Auto Loader	Streaming incremental file ingestion (see 2.1)
External locations / volumes	Govern access to cloud storage paths via Unity Catalog
Lakehouse Federation	Query foreign databases in place via a foreign catalog

COPY INTO is a retriable SQL command for bounded, incremental batch loads — it remembers which files it loaded, so re-running it does not duplicate data; choose Auto Loader for very high file volumes and COPY INTO for simpler scheduled loads. Cloud storage is reached through Unity Catalog external locations (registered storage credentials + paths) and volumes, replacing legacy DBFS mounts. Lakehouse Federation lets you register an external database as a foreign catalog and query its tables directly with Spark SQL — no data movement, governed by Unity Catalog.

Together these cover file-based, table-based, and live-query extraction.

COPY INTO vs Auto Loader: Choosing an Ingestion Path

Both COPY INTO and Auto Loader load files idempotently into Delta, so the exam tests when to pick each:

Factor	COPY INTO	Auto Loader
Interface	SQL command	Structured Streaming (`cloudFiles`)
State tracking	Per-table load history	Checkpoint + RocksDB
Scale	Thousands of files	Millions of files
Schema evolution	Limited	Full (`schemaEvolutionMode`)
Best for	Periodic, simpler loads	Continuous / very high volume

A representative COPY INTO:

COPY INTO main.bronze.events
FROM '/landing/events'
FILEFORMAT = JSON
FORMAT_OPTIONS ('inferSchema'='true')
COPY_OPTIONS ('mergeSchema'='true');

Running this repeatedly only loads new files, because COPY INTO tracks which files it has already ingested. For a low-volume daily drop, COPY INTO is the simplest choice; once file counts climb into the millions or you need true streaming with rich schema evolution, Auto Loader is the recommendation.

Secure Access: Secrets, External Locations, and Volumes

Never embed credentials in notebook code. Databricks secret scopes store sensitive values, retrieved at runtime with dbutils.secrets.get(scope, key); the value is redacted in any output, so it cannot leak into logs. JDBC passwords, API tokens, and storage keys should all come from secrets.

For cloud storage, Unity Catalog governs access through two abstractions:

Storage credentials — an IAM role or service principal Unity Catalog uses to reach cloud storage.
External locations — a named path plus a storage credential, on which you grant READ FILES / WRITE FILES.
Volumes — governed file storage (/Volumes/cat/schema/vol/) for unstructured and semi-structured data, replacing DBFS mounts.

This governed model means a data engineer queries /Volumes/... or an external-location path and Unity Catalog enforces who may read or write, auditing every access — far safer than legacy mount points with embedded keys. For relational sources, Lakehouse Federation likewise uses a connection object (holding credentials) plus a foreign catalog to expose external tables, so the same governance and secret-management discipline applies whether the source is files, a JDBC database, or a federated warehouse. The recurring exam theme: extraction should be incremental, idempotent, and credential-safe.

Pushdown, Predicate Filtering, and Source Selection

When extracting from JDBC or federated sources, pushdown is what keeps reads fast: rather than pulling an entire table and filtering in Spark, push the work to the source. With JDBC you achieve this by passing a query option containing a WHERE clause, or by relying on Spark/Catalyst to translate a .filter() into source-side SQL. Lakehouse Federation similarly pushes filters, projections, and many aggregations down to the remote engine, so only the needed rows cross the network. The general principle: read as little as possible by letting the source do the filtering.

Choosing the right extraction mechanism is a frequent exam decision. A quick decision guide:

Scenario	Best choice
Files continuously landing, huge volume	Auto Loader (`cloudFiles`)
Periodic file drops, modest volume	`COPY INTO`
One-time pull from a relational DB	JDBC read (parallelized)
Query a live external DB without copying	Lakehouse Federation
Reach governed cloud files	Unity Catalog volumes / external locations

No matter the path, the engineering goals are consistent: ingest only new data (incrementality), tolerate retries without duplication (idempotency), keep credentials in secret scopes (security), and let the source filter where possible (pushdown). Internalizing this matrix lets you answer most 'which tool should I use to extract X?' questions on the certification quickly and correctly.

Test Your Knowledge

A JDBC read of a very large table runs slowly through a single task. Which set of options parallelizes the read?

header, inferSchema, sep

partitionColumn, lowerBound, upperBound, numPartitions

checkpointLocation, trigger, outputMode

schemaLocation, schemaEvolutionMode

Test Your Knowledge

Which feature lets you query an external PostgreSQL database directly from Databricks SQL without copying its data into the lakehouse?

Lakehouse Federation via a foreign catalog

COPY INTO

Auto Loader

dbutils.fs.cp

Test Your Knowledge

How should a database password be supplied to a JDBC read in a Databricks notebook?

Typed directly into the .option('password', ...) call as a literal string

Stored in a widget so it appears in the notebook UI

Printed to the cell output for verification

Retrieved with dbutils.secrets.get from a secret scope

Up Next

2.10 Table Management and Properties

Continue learning

Databricks Certified Data Engineer Associate

Databricks Certified Data Engineer Associate

2.9 Data Extraction from External Sources

Key Takeaways

JDBC: Reading Relational Databases

COPY INTO, Cloud Storage, and Lakehouse Federation

COPY INTO vs Auto Loader: Choosing an Ingestion Path

Secure Access: Secrets, External Locations, and Volumes

Pushdown, Predicate Filtering, and Source Selection

Databricks Certified Data Engineer Associate

1Introduction

2Domain 1: Databricks Intelligence Platform (10%)

3Domain 2: Development and Ingestion (30%)

4Domain 3: Data Processing & Transformations (31%)

5Domain 4: Productionizing Data Pipelines (18%)

6Domain 5: Data Governance & Quality (11%)

Databricks Certified Data Engineer Associate

2.9 Data Extraction from External Sources

Key Takeaways

JDBC: Reading Relational Databases

COPY INTO, Cloud Storage, and Lakehouse Federation

COPY INTO vs Auto Loader: Choosing an Ingestion Path

Secure Access: Secrets, External Locations, and Volumes

Pushdown, Predicate Filtering, and Source Selection