2.2 Reading Data with Spark SQL and PySpark

Key Takeaways

spark.read returns a DataFrameReader; you set .format(), .option()/.options(), and either .schema() or inferSchema before calling .load() or a shortcut like .csv()/.json()/.parquet().
CSV and JSON are schema-on-read text formats needing header/inferSchema options, while Parquet and Delta carry their own schema and are columnar and far faster.
In Spark SQL you query files directly with the format.`/path` syntax, e.g. SELECT * FROM json.`/data/events`.
Providing an explicit schema (StructType or DDL string) avoids an extra inference pass and prevents wrong type guesses.
Reading is lazy: load() builds a plan and no data moves until an action executes.

Last updated: June 2026

The DataFrameReader

All batch reads in PySpark start with spark.read, which returns a DataFrameReader. You chain configuration methods and finish with an action-like load():

df = (spark.read
  .format("csv")
  .option("header", "true")
  .option("inferSchema", "true")
  .load("/data/people.csv"))

Format-specific shortcuts exist — spark.read.csv(path), .json(path), .parquet(path), .load(path) for Delta with .format("delta"). The general methods are:

Method	Role
`.format(name)`	Source format (csv, json, parquet, delta, jdbc, text)
`.option(k, v)` / `.options(dict)`	Per-format read options
`.schema(schema)`	Explicit StructType or DDL-string schema
`.load(path)`	Build the DataFrame plan

Reads are lazy: load() only defines a plan. No files are scanned until an action such as display(), count(), collect(), or write runs.

Format Characteristics

Choosing a format matters for both correctness and speed:

Format	Schema	Notes
CSV	Schema-on-read	Need `header`, `inferSchema`, `sep`, `nullValue`; slow, error-prone types
JSON	Schema-on-read	Supports nested structs/arrays; `multiLine` for pretty-printed files
Parquet	Self-describing	Columnar, compressed, predicate pushdown, fast
Delta	Self-describing	Parquet + transaction log; ACID, time travel, schema enforcement

CSV and JSON are text formats that carry no reliable type metadata, so Spark must either infer types (an extra full pass over the data) or accept a provided schema. Parquet stores the schema and statistics in its footer, enabling column pruning and predicate pushdown. Delta builds on Parquet by adding a transaction log (_delta_log) that delivers ACID guarantees, time travel, and schema enforcement — it is the default and preferred table format on Databricks.

Providing a Schema and Reading via SQL

Letting Spark infer a schema for CSV/JSON triggers an extra scan and can guess types incorrectly (e.g., a ZIP code as an integer, losing leading zeros). For production reads, supply an explicit schema as a StructType or a concise DDL string:

from pyspark.sql.types import StructType, StructField, StringType, IntegerType
schema = StructType([
  StructField("id", IntegerType()),
  StructField("name", StringType())])
df = spark.read.schema(schema).json("/data/users")
# or: spark.read.schema("id INT, name STRING").json("/data/users")

In Spark SQL you can query files directly without first creating a table using the format.\path`` syntax:

SELECT * FROM json.`/data/events`;
SELECT count(*) FROM parquet.`/data/sales`;

This is convenient for exploration but offers no caching or statistics; for repeated access, register a table or view instead.

Read Modes, Common Options, and Lazy Plans

When reading semi-structured text, Spark must decide what to do with malformed records. The CSV and JSON readers accept a mode option:

Mode	Behavior on bad records
`PERMISSIVE` (default)	Set unparsable fields to null; corrupt rows go to `_corrupt_record`
`DROPMALFORMED`	Discard rows that fail to parse
`FAILFAST`	Throw an exception on the first bad record

Other frequently used CSV/JSON options include sep (delimiter), nullValue, dateFormat, escape, quote, and multiLine (for JSON objects or CSV fields that span lines). For temporary views over files you can also use spark.read...createOrReplaceTempView("v") to query the data by name.

Because reads are lazy, you can layer transformations — select, filter, withColumn — and Spark's Catalyst optimizer fuses them into one physical plan, only scanning the columns and partitions actually needed when an action finally runs. This means putting a narrow select and a selective filter right after a Parquet/Delta read lets Spark prune columns and push the predicate down to the file footer, dramatically reducing I/O. With text formats no such statistics exist, so Spark must read whole files.

A practical rule for the exam: choose the most structured format available (Delta > Parquet > JSON > CSV), supply an explicit schema for text sources, and rely on laziness plus column pruning for performance rather than reading then discarding data in Python. Note too that DDL-string schemas ("id INT, name STRING") are far more concise than building a full StructType and are equally valid wherever a schema is accepted.

Transformations vs Actions and Inspecting a DataFrame

Understanding the transformation/action split is fundamental to how Spark reads and processes data. Transformations (select, filter, withColumn, join, groupBy, orderBy) are lazy — they build up the logical plan but move no data. Only an action triggers execution and reads files:

Common actions	What they do
`count()`	Return the number of rows
`collect()`	Bring all rows to the driver (use cautiously)
`take(n)` / `show(n)`	Return / display the first n rows
`write` / `save` / `toTable`	Persist the result
`display(df)`	Databricks-native rich rendering

A further distinction inside transformations: narrow transformations (select, filter) need no data movement between partitions, while wide transformations (groupBy, join, distinct) trigger a shuffle across the cluster. Knowing which operations shuffle helps explain query cost.

To understand what data a read produced, use df.printSchema() to see the inferred or supplied structure, df.columns for column names, and df.dtypes for types. df.explain() prints the physical plan, revealing whether column pruning and predicate pushdown were applied to a Parquet or Delta scan. These inspection tools are how an engineer confirms that a read is efficient before scaling it to production volumes — exactly the diagnostic workflow the certification expects.

Test Your Knowledge

When reading a CSV with column names in the first row, which option must you set so Spark does not treat that row as data?

header = true

multiLine = true

mode = PERMISSIVE

inferSchema = true

Test Your Knowledge

Why is Parquet generally preferred over CSV for analytics reads in Spark?

Parquet files are always smaller because they are plain text

Parquet does not require a Spark session to read

CSV cannot store more than one column

Parquet is self-describing and columnar, enabling column pruning and predicate pushdown

Test Your Knowledge

Which Spark SQL statement correctly queries JSON files directly from a path without creating a table?

SELECT * FROM read('/data/events', 'json')

SELECT * FROM /data/events.json

SELECT * FROM json./data/events

SELECT * USING json LOCATION '/data/events'

Up Next

2.3 Writing Data to Delta Tables

Continue learning

Databricks Certified Data Engineer Associate

Databricks Certified Data Engineer Associate

2.2 Reading Data with Spark SQL and PySpark

Key Takeaways

The DataFrameReader

Format Characteristics

Providing a Schema and Reading via SQL

Read Modes, Common Options, and Lazy Plans

Transformations vs Actions and Inspecting a DataFrame

Databricks Certified Data Engineer Associate

1Introduction

2Domain 1: Databricks Intelligence Platform (10%)

3Domain 2: Development and Ingestion (30%)

4Domain 3: Data Processing & Transformations (31%)

5Domain 4: Productionizing Data Pipelines (18%)

6Domain 5: Data Governance & Quality (11%)

Databricks Certified Data Engineer Associate

2.2 Reading Data with Spark SQL and PySpark

Key Takeaways

The DataFrameReader

Format Characteristics

Providing a Schema and Reading via SQL

Read Modes, Common Options, and Lazy Plans

Transformations vs Actions and Inspecting a DataFrame