2.9 Data Extraction from External Sources

Key Takeaways

  • JDBC reads pull from relational databases via .format('jdbc') with url, dbtable (or query), user, and password; partitionColumn/lowerBound/upperBound/numPartitions parallelize the read.
  • COPY INTO is an idempotent SQL command that incrementally loads files into a Delta table, skipping already-loaded files.
  • Cloud storage is accessed through Unity Catalog external locations and volumes rather than legacy DBFS mounts.
  • Lakehouse Federation queries external databases (PostgreSQL, MySQL, Snowflake, Redshift) in place via foreign catalogs without copying data.
  • Secrets for source credentials should come from dbutils.secrets / secret scopes, never hard-coded in notebooks.
Last updated: June 2026

JDBC: Reading Relational Databases

To pull data from an external relational database (PostgreSQL, MySQL, SQL Server, Oracle), Spark uses the JDBC data source. You supply a connection URL, a table or query, and credentials:

df = (spark.read.format("jdbc")
  .option("url", "jdbc:postgresql://host:5432/sales")
  .option("dbtable", "public.orders")
  .option("user", user)
  .option("password", pw)
  .load())

By default a JDBC read uses a single connection, so the whole table streams through one task — a bottleneck for large tables. To parallelize, provide partitionColumn, lowerBound, upperBound, and numPartitions; Spark then issues numPartitions range queries in parallel. Use a query option (a subquery) to push filtering and joins down to the source database. Credentials must come from dbutils.secrets / a secret scope — never hard-code passwords in a notebook.

COPY INTO, Cloud Storage, and Lakehouse Federation

Several other extraction paths appear on the exam:

MechanismWhat it does
COPY INTOIdempotently loads files into a Delta table, skipping files already ingested
Auto LoaderStreaming incremental file ingestion (see 2.1)
External locations / volumesGovern access to cloud storage paths via Unity Catalog
Lakehouse FederationQuery foreign databases in place via a foreign catalog

COPY INTO is a retriable SQL command for bounded, incremental batch loads — it remembers which files it loaded, so re-running it does not duplicate data; choose Auto Loader for very high file volumes and COPY INTO for simpler scheduled loads. Cloud storage is reached through Unity Catalog external locations (registered storage credentials + paths) and volumes, replacing legacy DBFS mounts. Lakehouse Federation lets you register an external database as a foreign catalog and query its tables directly with Spark SQL — no data movement, governed by Unity Catalog.

Together these cover file-based, table-based, and live-query extraction.

COPY INTO vs Auto Loader: Choosing an Ingestion Path

Both COPY INTO and Auto Loader load files idempotently into Delta, so the exam tests when to pick each:

FactorCOPY INTOAuto Loader
InterfaceSQL commandStructured Streaming (cloudFiles)
State trackingPer-table load historyCheckpoint + RocksDB
ScaleThousands of filesMillions of files
Schema evolutionLimitedFull (schemaEvolutionMode)
Best forPeriodic, simpler loadsContinuous / very high volume

A representative COPY INTO:

COPY INTO main.bronze.events
FROM '/landing/events'
FILEFORMAT = JSON
FORMAT_OPTIONS ('inferSchema'='true')
COPY_OPTIONS ('mergeSchema'='true');

Running this repeatedly only loads new files, because COPY INTO tracks which files it has already ingested. For a low-volume daily drop, COPY INTO is the simplest choice; once file counts climb into the millions or you need true streaming with rich schema evolution, Auto Loader is the recommendation.

Secure Access: Secrets, External Locations, and Volumes

Never embed credentials in notebook code. Databricks secret scopes store sensitive values, retrieved at runtime with dbutils.secrets.get(scope, key); the value is redacted in any output, so it cannot leak into logs. JDBC passwords, API tokens, and storage keys should all come from secrets.

For cloud storage, Unity Catalog governs access through two abstractions:

  • Storage credentials — an IAM role or service principal Unity Catalog uses to reach cloud storage.
  • External locations — a named path plus a storage credential, on which you grant READ FILES / WRITE FILES.
  • Volumes — governed file storage (/Volumes/cat/schema/vol/) for unstructured and semi-structured data, replacing DBFS mounts.

This governed model means a data engineer queries /Volumes/... or an external-location path and Unity Catalog enforces who may read or write, auditing every access — far safer than legacy mount points with embedded keys. For relational sources, Lakehouse Federation likewise uses a connection object (holding credentials) plus a foreign catalog to expose external tables, so the same governance and secret-management discipline applies whether the source is files, a JDBC database, or a federated warehouse. The recurring exam theme: extraction should be incremental, idempotent, and credential-safe.

Pushdown, Predicate Filtering, and Source Selection

When extracting from JDBC or federated sources, pushdown is what keeps reads fast: rather than pulling an entire table and filtering in Spark, push the work to the source. With JDBC you achieve this by passing a query option containing a WHERE clause, or by relying on Spark/Catalyst to translate a .filter() into source-side SQL. Lakehouse Federation similarly pushes filters, projections, and many aggregations down to the remote engine, so only the needed rows cross the network. The general principle: read as little as possible by letting the source do the filtering.

Choosing the right extraction mechanism is a frequent exam decision. A quick decision guide:

ScenarioBest choice
Files continuously landing, huge volumeAuto Loader (cloudFiles)
Periodic file drops, modest volumeCOPY INTO
One-time pull from a relational DBJDBC read (parallelized)
Query a live external DB without copyingLakehouse Federation
Reach governed cloud filesUnity Catalog volumes / external locations

No matter the path, the engineering goals are consistent: ingest only new data (incrementality), tolerate retries without duplication (idempotency), keep credentials in secret scopes (security), and let the source filter where possible (pushdown). Internalizing this matrix lets you answer most 'which tool should I use to extract X?' questions on the certification quickly and correctly.

Test Your Knowledge

A JDBC read of a very large table runs slowly through a single task. Which set of options parallelizes the read?

A
B
C
D
Test Your Knowledge

Which feature lets you query an external PostgreSQL database directly from Databricks SQL without copying its data into the lakehouse?

A
B
C
D
Test Your Knowledge

How should a database password be supplied to a JDBC read in a Databricks notebook?

A
B
C
D