2.9 Data Extraction from External Sources
Key Takeaways
- JDBC reads pull from relational databases via .format('jdbc') with url, dbtable (or query), user, and password; partitionColumn/lowerBound/upperBound/numPartitions parallelize the read.
- COPY INTO is an idempotent SQL command that incrementally loads files into a Delta table, skipping already-loaded files.
- Cloud storage is accessed through Unity Catalog external locations and volumes rather than legacy DBFS mounts.
- Lakehouse Federation queries external databases (PostgreSQL, MySQL, Snowflake, Redshift) in place via foreign catalogs without copying data.
- Secrets for source credentials should come from dbutils.secrets / secret scopes, never hard-coded in notebooks.
JDBC: Reading Relational Databases
To pull data from an external relational database (PostgreSQL, MySQL, SQL Server, Oracle), Spark uses the JDBC data source. You supply a connection URL, a table or query, and credentials:
df = (spark.read.format("jdbc")
.option("url", "jdbc:postgresql://host:5432/sales")
.option("dbtable", "public.orders")
.option("user", user)
.option("password", pw)
.load())
By default a JDBC read uses a single connection, so the whole table streams through one task — a bottleneck for large tables. To parallelize, provide partitionColumn, lowerBound, upperBound, and numPartitions; Spark then issues numPartitions range queries in parallel. Use a query option (a subquery) to push filtering and joins down to the source database. Credentials must come from dbutils.secrets / a secret scope — never hard-code passwords in a notebook.
COPY INTO, Cloud Storage, and Lakehouse Federation
Several other extraction paths appear on the exam:
| Mechanism | What it does |
|---|---|
| COPY INTO | Idempotently loads files into a Delta table, skipping files already ingested |
| Auto Loader | Streaming incremental file ingestion (see 2.1) |
| External locations / volumes | Govern access to cloud storage paths via Unity Catalog |
| Lakehouse Federation | Query foreign databases in place via a foreign catalog |
COPY INTO is a retriable SQL command for bounded, incremental batch loads — it remembers which files it loaded, so re-running it does not duplicate data; choose Auto Loader for very high file volumes and COPY INTO for simpler scheduled loads. Cloud storage is reached through Unity Catalog external locations (registered storage credentials + paths) and volumes, replacing legacy DBFS mounts. Lakehouse Federation lets you register an external database as a foreign catalog and query its tables directly with Spark SQL — no data movement, governed by Unity Catalog.
Together these cover file-based, table-based, and live-query extraction.
COPY INTO vs Auto Loader: Choosing an Ingestion Path
Both COPY INTO and Auto Loader load files idempotently into Delta, so the exam tests when to pick each:
| Factor | COPY INTO | Auto Loader |
|---|---|---|
| Interface | SQL command | Structured Streaming (cloudFiles) |
| State tracking | Per-table load history | Checkpoint + RocksDB |
| Scale | Thousands of files | Millions of files |
| Schema evolution | Limited | Full (schemaEvolutionMode) |
| Best for | Periodic, simpler loads | Continuous / very high volume |
A representative COPY INTO:
COPY INTO main.bronze.events
FROM '/landing/events'
FILEFORMAT = JSON
FORMAT_OPTIONS ('inferSchema'='true')
COPY_OPTIONS ('mergeSchema'='true');
Running this repeatedly only loads new files, because COPY INTO tracks which files it has already ingested. For a low-volume daily drop, COPY INTO is the simplest choice; once file counts climb into the millions or you need true streaming with rich schema evolution, Auto Loader is the recommendation.
Secure Access: Secrets, External Locations, and Volumes
Never embed credentials in notebook code. Databricks secret scopes store sensitive values, retrieved at runtime with dbutils.secrets.get(scope, key); the value is redacted in any output, so it cannot leak into logs. JDBC passwords, API tokens, and storage keys should all come from secrets.
For cloud storage, Unity Catalog governs access through two abstractions:
- Storage credentials — an IAM role or service principal Unity Catalog uses to reach cloud storage.
- External locations — a named path plus a storage credential, on which you grant
READ FILES/WRITE FILES. - Volumes — governed file storage (
/Volumes/cat/schema/vol/) for unstructured and semi-structured data, replacing DBFS mounts.
This governed model means a data engineer queries /Volumes/... or an external-location path and Unity Catalog enforces who may read or write, auditing every access — far safer than legacy mount points with embedded keys. For relational sources, Lakehouse Federation likewise uses a connection object (holding credentials) plus a foreign catalog to expose external tables, so the same governance and secret-management discipline applies whether the source is files, a JDBC database, or a federated warehouse. The recurring exam theme: extraction should be incremental, idempotent, and credential-safe.
Pushdown, Predicate Filtering, and Source Selection
When extracting from JDBC or federated sources, pushdown is what keeps reads fast: rather than pulling an entire table and filtering in Spark, push the work to the source. With JDBC you achieve this by passing a query option containing a WHERE clause, or by relying on Spark/Catalyst to translate a .filter() into source-side SQL. Lakehouse Federation similarly pushes filters, projections, and many aggregations down to the remote engine, so only the needed rows cross the network. The general principle: read as little as possible by letting the source do the filtering.
Choosing the right extraction mechanism is a frequent exam decision. A quick decision guide:
| Scenario | Best choice |
|---|---|
| Files continuously landing, huge volume | Auto Loader (cloudFiles) |
| Periodic file drops, modest volume | COPY INTO |
| One-time pull from a relational DB | JDBC read (parallelized) |
| Query a live external DB without copying | Lakehouse Federation |
| Reach governed cloud files | Unity Catalog volumes / external locations |
No matter the path, the engineering goals are consistent: ingest only new data (incrementality), tolerate retries without duplication (idempotency), keep credentials in secret scopes (security), and let the source filter where possible (pushdown). Internalizing this matrix lets you answer most 'which tool should I use to extract X?' questions on the certification quickly and correctly.
A JDBC read of a very large table runs slowly through a single task. Which set of options parallelizes the read?
Which feature lets you query an external PostgreSQL database directly from Databricks SQL without copying its data into the lakehouse?
How should a database password be supplied to a JDBC read in a Databricks notebook?