2.9 Data Extraction from External Sources

Key Takeaways

  • Databricks can read data from cloud storage (S3, ADLS, GCS), databases (JDBC), message queues (Kafka, Event Hubs), and REST APIs.
  • JDBC connections enable reading from relational databases like PostgreSQL, MySQL, SQL Server, and Oracle directly into Spark DataFrames.
  • Unity Catalog volumes provide governed storage for non-tabular files like CSVs, JSONs, images, and model artifacts.
  • The read_files() table-valued function in SQL provides a convenient way to read files from cloud storage and volumes.
  • Connection credentials should be stored in Databricks secrets or Unity Catalog service credentials, never hardcoded in notebooks.
Last updated: March 2026

Data Extraction from External Sources

Quick Answer: Databricks reads from cloud storage (Auto Loader, read_files), databases (JDBC), message queues (Kafka), and file systems (volumes). Store credentials in Databricks secrets, not in code.

Cloud Storage Reads

read_files() SQL Function

-- Read JSON files from a volume
SELECT * FROM read_files(
    '/Volumes/my_catalog/my_schema/raw_data/events/',
    format => 'json',
    header => 'true'
);

-- Read CSV with specific options
SELECT * FROM read_files(
    's3://my-bucket/data/customers.csv',
    format => 'csv',
    header => 'true',
    inferSchema => 'true',
    delimiter => '|'
);

Unity Catalog Volumes

-- Create a volume for file storage
CREATE VOLUME my_catalog.my_schema.raw_data;

-- List files in a volume
LIST '/Volumes/my_catalog/my_schema/raw_data/';

-- Read files from a volume
SELECT * FROM read_files('/Volumes/my_catalog/my_schema/raw_data/events/');

JDBC Database Reads

# Read from a PostgreSQL database via JDBC
jdbc_url = "jdbc:postgresql://host:5432/database"
connection_properties = {
    "user": dbutils.secrets.get("scope", "pg-user"),
    "password": dbutils.secrets.get("scope", "pg-password"),
    "driver": "org.postgresql.Driver"
}

df = (spark.read
    .jdbc(jdbc_url, "public.customers", properties=connection_properties)
)

# Read with query pushdown
df = (spark.read
    .jdbc(
        url=jdbc_url,
        table="(SELECT * FROM customers WHERE active = true) AS t",
        properties=connection_properties
    )
)

# Read with partitioning for parallel extraction
df = (spark.read
    .jdbc(
        url=jdbc_url,
        table="orders",
        column="order_id",
        lowerBound=1,
        upperBound=1000000,
        numPartitions=10,
        properties=connection_properties
    )
)

Kafka and Event Hubs

# Read from Apache Kafka
kafka_df = (spark.readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "broker:9092")
    .option("subscribe", "events-topic")
    .option("startingOffsets", "latest")
    .load()
    .select(
        col("key").cast("string"),
        from_json(col("value").cast("string"), schema).alias("data"),
        col("timestamp")
    )
)

Credential Management

MethodSecurityBest For
Databricks SecretsEncrypted, redacted in outputAPI keys, database passwords
Service CredentialsUnity Catalog managedCloud storage access
External VaultsAzure Key Vault, AWS Secrets ManagerEnterprise credential management
HardcodedNEVERAbsolutely not — security risk

On the Exam: Know that Auto Loader is preferred for cloud storage ingestion, JDBC for database reads, and credentials must be managed via secrets. Understand that read_files() provides SQL access to file data.

Test Your Knowledge

A data engineer needs to extract data from a PostgreSQL database into Databricks for processing. Which approach should they use?

A
B
C
D
Test Your Knowledge

Where should database connection passwords be stored when used in Databricks notebooks?

A
B
C
D