2.2 Reading Data with Spark SQL and PySpark

Key Takeaways

  • spark.read returns a DataFrameReader; you set .format(), .option()/.options(), and either .schema() or inferSchema before calling .load() or a shortcut like .csv()/.json()/.parquet().
  • CSV and JSON are schema-on-read text formats needing header/inferSchema options, while Parquet and Delta carry their own schema and are columnar and far faster.
  • In Spark SQL you query files directly with the format.`/path` syntax, e.g. SELECT * FROM json.`/data/events`.
  • Providing an explicit schema (StructType or DDL string) avoids an extra inference pass and prevents wrong type guesses.
  • Reading is lazy: load() builds a plan and no data moves until an action executes.
Last updated: June 2026

The DataFrameReader

All batch reads in PySpark start with spark.read, which returns a DataFrameReader. You chain configuration methods and finish with an action-like load():

df = (spark.read
  .format("csv")
  .option("header", "true")
  .option("inferSchema", "true")
  .load("/data/people.csv"))

Format-specific shortcuts exist — spark.read.csv(path), .json(path), .parquet(path), .load(path) for Delta with .format("delta"). The general methods are:

MethodRole
.format(name)Source format (csv, json, parquet, delta, jdbc, text)
.option(k, v) / .options(dict)Per-format read options
.schema(schema)Explicit StructType or DDL-string schema
.load(path)Build the DataFrame plan

Reads are lazy: load() only defines a plan. No files are scanned until an action such as display(), count(), collect(), or write runs.

Format Characteristics

Choosing a format matters for both correctness and speed:

FormatSchemaNotes
CSVSchema-on-readNeed header, inferSchema, sep, nullValue; slow, error-prone types
JSONSchema-on-readSupports nested structs/arrays; multiLine for pretty-printed files
ParquetSelf-describingColumnar, compressed, predicate pushdown, fast
DeltaSelf-describingParquet + transaction log; ACID, time travel, schema enforcement

CSV and JSON are text formats that carry no reliable type metadata, so Spark must either infer types (an extra full pass over the data) or accept a provided schema. Parquet stores the schema and statistics in its footer, enabling column pruning and predicate pushdown. Delta builds on Parquet by adding a transaction log (_delta_log) that delivers ACID guarantees, time travel, and schema enforcement — it is the default and preferred table format on Databricks.

Providing a Schema and Reading via SQL

Letting Spark infer a schema for CSV/JSON triggers an extra scan and can guess types incorrectly (e.g., a ZIP code as an integer, losing leading zeros). For production reads, supply an explicit schema as a StructType or a concise DDL string:

from pyspark.sql.types import StructType, StructField, StringType, IntegerType
schema = StructType([
  StructField("id", IntegerType()),
  StructField("name", StringType())])
df = spark.read.schema(schema).json("/data/users")
# or: spark.read.schema("id INT, name STRING").json("/data/users")

In Spark SQL you can query files directly without first creating a table using the format.\path`` syntax:

SELECT * FROM json.`/data/events`;
SELECT count(*) FROM parquet.`/data/sales`;

This is convenient for exploration but offers no caching or statistics; for repeated access, register a table or view instead.

Read Modes, Common Options, and Lazy Plans

When reading semi-structured text, Spark must decide what to do with malformed records. The CSV and JSON readers accept a mode option:

ModeBehavior on bad records
PERMISSIVE (default)Set unparsable fields to null; corrupt rows go to _corrupt_record
DROPMALFORMEDDiscard rows that fail to parse
FAILFASTThrow an exception on the first bad record

Other frequently used CSV/JSON options include sep (delimiter), nullValue, dateFormat, escape, quote, and multiLine (for JSON objects or CSV fields that span lines). For temporary views over files you can also use spark.read...createOrReplaceTempView("v") to query the data by name.

Because reads are lazy, you can layer transformations — select, filter, withColumn — and Spark's Catalyst optimizer fuses them into one physical plan, only scanning the columns and partitions actually needed when an action finally runs. This means putting a narrow select and a selective filter right after a Parquet/Delta read lets Spark prune columns and push the predicate down to the file footer, dramatically reducing I/O. With text formats no such statistics exist, so Spark must read whole files.

A practical rule for the exam: choose the most structured format available (Delta > Parquet > JSON > CSV), supply an explicit schema for text sources, and rely on laziness plus column pruning for performance rather than reading then discarding data in Python. Note too that DDL-string schemas ("id INT, name STRING") are far more concise than building a full StructType and are equally valid wherever a schema is accepted.

Transformations vs Actions and Inspecting a DataFrame

Understanding the transformation/action split is fundamental to how Spark reads and processes data. Transformations (select, filter, withColumn, join, groupBy, orderBy) are lazy — they build up the logical plan but move no data. Only an action triggers execution and reads files:

Common actionsWhat they do
count()Return the number of rows
collect()Bring all rows to the driver (use cautiously)
take(n) / show(n)Return / display the first n rows
write / save / toTablePersist the result
display(df)Databricks-native rich rendering

A further distinction inside transformations: narrow transformations (select, filter) need no data movement between partitions, while wide transformations (groupBy, join, distinct) trigger a shuffle across the cluster. Knowing which operations shuffle helps explain query cost.

To understand what data a read produced, use df.printSchema() to see the inferred or supplied structure, df.columns for column names, and df.dtypes for types. df.explain() prints the physical plan, revealing whether column pruning and predicate pushdown were applied to a Parquet or Delta scan. These inspection tools are how an engineer confirms that a read is efficient before scaling it to production volumes — exactly the diagnostic workflow the certification expects.

Test Your Knowledge

When reading a CSV with column names in the first row, which option must you set so Spark does not treat that row as data?

A
B
C
D
Test Your Knowledge

Why is Parquet generally preferred over CSV for analytics reads in Spark?

A
B
C
D
Test Your Knowledge

Which Spark SQL statement correctly queries JSON files directly from a path without creating a table?

A
B
C
D