2.2 Reading Data with Spark SQL and PySpark

Key Takeaways

  • Spark SQL and PySpark both provide methods to read data from files, tables, and external sources into DataFrames.
  • Common file formats include Parquet, Delta, JSON, CSV, Avro, and ORC — each with specific read options.
  • CREATE TABLE and CREATE VIEW are the SQL-equivalent ways to define persistent and virtual data access objects.
  • Temporary views exist only for the current Spark session, while global temporary views are accessible across notebooks in the same cluster.
  • External tables point to data stored outside of the managed Delta location, allowing reads without data movement.
Last updated: March 2026

Reading Data with Spark SQL and PySpark

Quick Answer: Spark reads data from files using format-specific readers (spark.read.format()) or SQL commands (SELECT, CREATE TABLE). Data can be read from Parquet, Delta, JSON, CSV, and other formats. Tables persist metadata in the catalog, while views are virtual and computed on each query.

Reading Files with PySpark

Common File Reads

# Read Parquet files
df = spark.read.format("parquet").load("/data/files/sales.parquet")

# Read JSON files
df = spark.read.format("json").load("/data/files/events.json")

# Read CSV files with header and schema inference
df = (spark.read
    .format("csv")
    .option("header", "true")
    .option("inferSchema", "true")
    .option("sep", ",")
    .load("/data/files/customers.csv")
)

# Read Delta table by path
df = spark.read.format("delta").load("/data/delta/orders/")

# Read Delta table by name (Unity Catalog)
df = spark.table("my_catalog.my_schema.orders")

Read Options for Common Formats

FormatCommon OptionsExample
CSVheader, inferSchema, sep, quote, escape, multiLine.option("header", "true")
JSONmultiLine, primitivesAsString.option("multiLine", "true")
ParquetmergeSchema.option("mergeSchema", "true")
DeltaversionAsOf, timestampAsOf.option("versionAsOf", "5")

Reading Data with SQL

Direct File Reads in SQL

-- Read JSON files directly in SQL using backtick notation
SELECT * FROM json.`/data/files/events.json`;

-- Read CSV files
SELECT * FROM csv.`/data/files/customers.csv`;

-- Read Parquet files
SELECT * FROM parquet.`/data/files/sales.parquet`;

-- Read Delta files by path
SELECT * FROM delta.`/data/delta/orders/`;

Tables: Managed vs. External

Managed Tables

-- Managed table: Databricks manages both metadata AND data files
CREATE TABLE my_catalog.my_schema.sales (
    order_id BIGINT,
    customer_id BIGINT,
    amount DECIMAL(10,2),
    order_date DATE
);

-- Create table from query results
CREATE TABLE my_catalog.my_schema.sales_2026 AS
SELECT * FROM my_catalog.my_schema.sales
WHERE order_date >= '2026-01-01';
  • Dropping a managed table deletes both the metadata AND the data files.
  • Data is stored in the Unity Catalog managed storage location.

External Tables

-- External table: Databricks manages metadata only, data lives at LOCATION
CREATE TABLE my_catalog.my_schema.external_sales
LOCATION 's3://my-bucket/data/sales/'
AS SELECT * FROM raw_data;
  • Dropping an external table deletes only the metadata. The data files remain.
  • Data is stored at the specified LOCATION, not in managed storage.
AspectManaged TableExternal Table
Data locationUnity Catalog managed storageUser-specified LOCATION
DROP TABLEDeletes metadata + dataDeletes metadata only
Data lifecycleTied to table lifecycleIndependent of table
Best forMost use casesData shared with other systems

Views

Regular Views

-- A view is a saved query — no data is stored
CREATE VIEW my_catalog.my_schema.high_value_orders AS
SELECT * FROM my_catalog.my_schema.orders
WHERE total_amount > 1000;

Temporary Views

-- Temporary view: exists only for the current Spark session
CREATE TEMP VIEW recent_orders AS
SELECT * FROM my_catalog.my_schema.orders
WHERE order_date >= current_date() - INTERVAL 7 DAYS;

Global Temporary Views

-- Global temp view: accessible across notebooks attached to the same cluster
CREATE GLOBAL TEMP VIEW cross_notebook_data AS
SELECT * FROM my_catalog.my_schema.orders;

-- Must reference with global_temp schema
SELECT * FROM global_temp.cross_notebook_data;
View TypeScopePersistsQuery
ViewCatalog (permanent)Yes — persists in Unity CatalogSELECT * FROM catalog.schema.view_name
Temp ViewCurrent Spark sessionNo — gone when session endsSELECT * FROM view_name
Global Temp ViewAll sessions on the same clusterNo — gone when cluster restartsSELECT * FROM global_temp.view_name

On the Exam: Know the difference between managed tables (DROP deletes data), external tables (DROP keeps data), and views (no data stored). Also know that temp views are session-scoped while global temp views are cluster-scoped.

Test Your Knowledge

A data engineer drops a managed Delta table from the Unity Catalog. What happens to the underlying data files?

A
B
C
D
Test Your Knowledge

Which SQL syntax correctly reads a JSON file directly in a Spark SQL query?

A
B
C
D
Test Your Knowledge

How do you access a global temporary view named "shared_data" in a SQL query?

A
B
C
D