2.2 Reading Data with Spark SQL and PySpark
Key Takeaways
- Spark SQL and PySpark both provide methods to read data from files, tables, and external sources into DataFrames.
- Common file formats include Parquet, Delta, JSON, CSV, Avro, and ORC — each with specific read options.
- CREATE TABLE and CREATE VIEW are the SQL-equivalent ways to define persistent and virtual data access objects.
- Temporary views exist only for the current Spark session, while global temporary views are accessible across notebooks in the same cluster.
- External tables point to data stored outside of the managed Delta location, allowing reads without data movement.
Reading Data with Spark SQL and PySpark
Quick Answer: Spark reads data from files using format-specific readers (spark.read.format()) or SQL commands (SELECT, CREATE TABLE). Data can be read from Parquet, Delta, JSON, CSV, and other formats. Tables persist metadata in the catalog, while views are virtual and computed on each query.
Reading Files with PySpark
Common File Reads
# Read Parquet files
df = spark.read.format("parquet").load("/data/files/sales.parquet")
# Read JSON files
df = spark.read.format("json").load("/data/files/events.json")
# Read CSV files with header and schema inference
df = (spark.read
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.option("sep", ",")
.load("/data/files/customers.csv")
)
# Read Delta table by path
df = spark.read.format("delta").load("/data/delta/orders/")
# Read Delta table by name (Unity Catalog)
df = spark.table("my_catalog.my_schema.orders")
Read Options for Common Formats
| Format | Common Options | Example |
|---|---|---|
| CSV | header, inferSchema, sep, quote, escape, multiLine | .option("header", "true") |
| JSON | multiLine, primitivesAsString | .option("multiLine", "true") |
| Parquet | mergeSchema | .option("mergeSchema", "true") |
| Delta | versionAsOf, timestampAsOf | .option("versionAsOf", "5") |
Reading Data with SQL
Direct File Reads in SQL
-- Read JSON files directly in SQL using backtick notation
SELECT * FROM json.`/data/files/events.json`;
-- Read CSV files
SELECT * FROM csv.`/data/files/customers.csv`;
-- Read Parquet files
SELECT * FROM parquet.`/data/files/sales.parquet`;
-- Read Delta files by path
SELECT * FROM delta.`/data/delta/orders/`;
Tables: Managed vs. External
Managed Tables
-- Managed table: Databricks manages both metadata AND data files
CREATE TABLE my_catalog.my_schema.sales (
order_id BIGINT,
customer_id BIGINT,
amount DECIMAL(10,2),
order_date DATE
);
-- Create table from query results
CREATE TABLE my_catalog.my_schema.sales_2026 AS
SELECT * FROM my_catalog.my_schema.sales
WHERE order_date >= '2026-01-01';
- Dropping a managed table deletes both the metadata AND the data files.
- Data is stored in the Unity Catalog managed storage location.
External Tables
-- External table: Databricks manages metadata only, data lives at LOCATION
CREATE TABLE my_catalog.my_schema.external_sales
LOCATION 's3://my-bucket/data/sales/'
AS SELECT * FROM raw_data;
- Dropping an external table deletes only the metadata. The data files remain.
- Data is stored at the specified LOCATION, not in managed storage.
| Aspect | Managed Table | External Table |
|---|---|---|
| Data location | Unity Catalog managed storage | User-specified LOCATION |
| DROP TABLE | Deletes metadata + data | Deletes metadata only |
| Data lifecycle | Tied to table lifecycle | Independent of table |
| Best for | Most use cases | Data shared with other systems |
Views
Regular Views
-- A view is a saved query — no data is stored
CREATE VIEW my_catalog.my_schema.high_value_orders AS
SELECT * FROM my_catalog.my_schema.orders
WHERE total_amount > 1000;
Temporary Views
-- Temporary view: exists only for the current Spark session
CREATE TEMP VIEW recent_orders AS
SELECT * FROM my_catalog.my_schema.orders
WHERE order_date >= current_date() - INTERVAL 7 DAYS;
Global Temporary Views
-- Global temp view: accessible across notebooks attached to the same cluster
CREATE GLOBAL TEMP VIEW cross_notebook_data AS
SELECT * FROM my_catalog.my_schema.orders;
-- Must reference with global_temp schema
SELECT * FROM global_temp.cross_notebook_data;
| View Type | Scope | Persists | Query |
|---|---|---|---|
| View | Catalog (permanent) | Yes — persists in Unity Catalog | SELECT * FROM catalog.schema.view_name |
| Temp View | Current Spark session | No — gone when session ends | SELECT * FROM view_name |
| Global Temp View | All sessions on the same cluster | No — gone when cluster restarts | SELECT * FROM global_temp.view_name |
On the Exam: Know the difference between managed tables (DROP deletes data), external tables (DROP keeps data), and views (no data stored). Also know that temp views are session-scoped while global temp views are cluster-scoped.
A data engineer drops a managed Delta table from the Unity Catalog. What happens to the underlying data files?
Which SQL syntax correctly reads a JSON file directly in a Spark SQL query?
How do you access a global temporary view named "shared_data" in a SQL query?