2.9 Data Extraction from External Sources
Key Takeaways
- Databricks can read data from cloud storage (S3, ADLS, GCS), databases (JDBC), message queues (Kafka, Event Hubs), and REST APIs.
- JDBC connections enable reading from relational databases like PostgreSQL, MySQL, SQL Server, and Oracle directly into Spark DataFrames.
- Unity Catalog volumes provide governed storage for non-tabular files like CSVs, JSONs, images, and model artifacts.
- The read_files() table-valued function in SQL provides a convenient way to read files from cloud storage and volumes.
- Connection credentials should be stored in Databricks secrets or Unity Catalog service credentials, never hardcoded in notebooks.
Last updated: March 2026
Data Extraction from External Sources
Quick Answer: Databricks reads from cloud storage (Auto Loader, read_files), databases (JDBC), message queues (Kafka), and file systems (volumes). Store credentials in Databricks secrets, not in code.
Cloud Storage Reads
read_files() SQL Function
-- Read JSON files from a volume
SELECT * FROM read_files(
'/Volumes/my_catalog/my_schema/raw_data/events/',
format => 'json',
header => 'true'
);
-- Read CSV with specific options
SELECT * FROM read_files(
's3://my-bucket/data/customers.csv',
format => 'csv',
header => 'true',
inferSchema => 'true',
delimiter => '|'
);
Unity Catalog Volumes
-- Create a volume for file storage
CREATE VOLUME my_catalog.my_schema.raw_data;
-- List files in a volume
LIST '/Volumes/my_catalog/my_schema/raw_data/';
-- Read files from a volume
SELECT * FROM read_files('/Volumes/my_catalog/my_schema/raw_data/events/');
JDBC Database Reads
# Read from a PostgreSQL database via JDBC
jdbc_url = "jdbc:postgresql://host:5432/database"
connection_properties = {
"user": dbutils.secrets.get("scope", "pg-user"),
"password": dbutils.secrets.get("scope", "pg-password"),
"driver": "org.postgresql.Driver"
}
df = (spark.read
.jdbc(jdbc_url, "public.customers", properties=connection_properties)
)
# Read with query pushdown
df = (spark.read
.jdbc(
url=jdbc_url,
table="(SELECT * FROM customers WHERE active = true) AS t",
properties=connection_properties
)
)
# Read with partitioning for parallel extraction
df = (spark.read
.jdbc(
url=jdbc_url,
table="orders",
column="order_id",
lowerBound=1,
upperBound=1000000,
numPartitions=10,
properties=connection_properties
)
)
Kafka and Event Hubs
# Read from Apache Kafka
kafka_df = (spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "broker:9092")
.option("subscribe", "events-topic")
.option("startingOffsets", "latest")
.load()
.select(
col("key").cast("string"),
from_json(col("value").cast("string"), schema).alias("data"),
col("timestamp")
)
)
Credential Management
| Method | Security | Best For |
|---|---|---|
| Databricks Secrets | Encrypted, redacted in output | API keys, database passwords |
| Service Credentials | Unity Catalog managed | Cloud storage access |
| External Vaults | Azure Key Vault, AWS Secrets Manager | Enterprise credential management |
| Hardcoded | NEVER | Absolutely not — security risk |
On the Exam: Know that Auto Loader is preferred for cloud storage ingestion, JDBC for database reads, and credentials must be managed via secrets. Understand that read_files() provides SQL access to file data.
Test Your Knowledge
A data engineer needs to extract data from a PostgreSQL database into Databricks for processing. Which approach should they use?
A
B
C
D
Test Your Knowledge
Where should database connection passwords be stored when used in Databricks notebooks?
A
B
C
D