2.9 Data Extraction from External Sources

Key Takeaways

Databricks can read data from cloud storage (S3, ADLS, GCS), databases (JDBC), message queues (Kafka, Event Hubs), and REST APIs.
JDBC connections enable reading from relational databases like PostgreSQL, MySQL, SQL Server, and Oracle directly into Spark DataFrames.
Unity Catalog volumes provide governed storage for non-tabular files like CSVs, JSONs, images, and model artifacts.
The read_files() table-valued function in SQL provides a convenient way to read files from cloud storage and volumes.
Connection credentials should be stored in Databricks secrets or Unity Catalog service credentials, never hardcoded in notebooks.

Last updated: March 2026

Data Extraction from External Sources

Quick Answer: Databricks reads from cloud storage (Auto Loader, read_files), databases (JDBC), message queues (Kafka), and file systems (volumes). Store credentials in Databricks secrets, not in code.

Cloud Storage Reads

read_files() SQL Function

-- Read JSON files from a volume
SELECT * FROM read_files(
    '/Volumes/my_catalog/my_schema/raw_data/events/',
    format => 'json',
    header => 'true'
);

-- Read CSV with specific options
SELECT * FROM read_files(
    's3://my-bucket/data/customers.csv',
    format => 'csv',
    header => 'true',
    inferSchema => 'true',
    delimiter => '|'
);

Unity Catalog Volumes

-- Create a volume for file storage
CREATE VOLUME my_catalog.my_schema.raw_data;

-- List files in a volume
LIST '/Volumes/my_catalog/my_schema/raw_data/';

-- Read files from a volume
SELECT * FROM read_files('/Volumes/my_catalog/my_schema/raw_data/events/');

JDBC Database Reads

# Read from a PostgreSQL database via JDBC
jdbc_url = "jdbc:postgresql://host:5432/database"
connection_properties = {
    "user": dbutils.secrets.get("scope", "pg-user"),
    "password": dbutils.secrets.get("scope", "pg-password"),
    "driver": "org.postgresql.Driver"
}

df = (spark.read
    .jdbc(jdbc_url, "public.customers", properties=connection_properties)
)

# Read with query pushdown
df = (spark.read
    .jdbc(
        url=jdbc_url,
        table="(SELECT * FROM customers WHERE active = true) AS t",
        properties=connection_properties
    )
)

# Read with partitioning for parallel extraction
df = (spark.read
    .jdbc(
        url=jdbc_url,
        table="orders",
        column="order_id",
        lowerBound=1,
        upperBound=1000000,
        numPartitions=10,
        properties=connection_properties
    )
)

Kafka and Event Hubs

# Read from Apache Kafka
kafka_df = (spark.readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "broker:9092")
    .option("subscribe", "events-topic")
    .option("startingOffsets", "latest")
    .load()
    .select(
        col("key").cast("string"),
        from_json(col("value").cast("string"), schema).alias("data"),
        col("timestamp")
    )
)

Credential Management

Method	Security	Best For
Databricks Secrets	Encrypted, redacted in output	API keys, database passwords
Service Credentials	Unity Catalog managed	Cloud storage access
External Vaults	Azure Key Vault, AWS Secrets Manager	Enterprise credential management
Hardcoded	NEVER	Absolutely not — security risk

On the Exam: Know that Auto Loader is preferred for cloud storage ingestion, JDBC for database reads, and credentials must be managed via secrets. Understand that read_files() provides SQL access to file data.

Test Your Knowledge

A data engineer needs to extract data from a PostgreSQL database into Databricks for processing. Which approach should they use?

Auto Loader with cloudFiles format pointing to the database

JDBC connection using spark.read.jdbc() with credentials from Databricks secrets

COPY INTO from the database connection string

Lakehouse Federation requires Pro SQL warehouses only

Test Your Knowledge

Where should database connection passwords be stored when used in Databricks notebooks?

In a notebook cell as a hardcoded string variable

In a configuration file uploaded to DBFS

In Databricks secrets (dbutils.secrets.get())

In a Delta table with restricted access

Up Next

2.10 Table Management and Properties

Continue learning

Databricks Certified Data Engineer Associate

1Introduction

2Domain 1: Databricks Intelligence Platform (10%)

3Domain 2: Development and Ingestion (30%)

4Domain 3: Data Processing & Transformations (31%)

5Domain 4: Productionizing Data Pipelines (18%)

6Domain 5: Data Governance & Quality (11%)