2.2 Reading Data with Spark SQL and PySpark
Key Takeaways
- spark.read returns a DataFrameReader; you set .format(), .option()/.options(), and either .schema() or inferSchema before calling .load() or a shortcut like .csv()/.json()/.parquet().
- CSV and JSON are schema-on-read text formats needing header/inferSchema options, while Parquet and Delta carry their own schema and are columnar and far faster.
- In Spark SQL you query files directly with the format.`/path` syntax, e.g. SELECT * FROM json.`/data/events`.
- Providing an explicit schema (StructType or DDL string) avoids an extra inference pass and prevents wrong type guesses.
- Reading is lazy: load() builds a plan and no data moves until an action executes.
The DataFrameReader
All batch reads in PySpark start with spark.read, which returns a DataFrameReader. You chain configuration methods and finish with an action-like load():
df = (spark.read
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("/data/people.csv"))
Format-specific shortcuts exist — spark.read.csv(path), .json(path), .parquet(path), .load(path) for Delta with .format("delta"). The general methods are:
| Method | Role |
|---|---|
.format(name) | Source format (csv, json, parquet, delta, jdbc, text) |
.option(k, v) / .options(dict) | Per-format read options |
.schema(schema) | Explicit StructType or DDL-string schema |
.load(path) | Build the DataFrame plan |
Reads are lazy: load() only defines a plan. No files are scanned until an action such as display(), count(), collect(), or write runs.
Format Characteristics
Choosing a format matters for both correctness and speed:
| Format | Schema | Notes |
|---|---|---|
| CSV | Schema-on-read | Need header, inferSchema, sep, nullValue; slow, error-prone types |
| JSON | Schema-on-read | Supports nested structs/arrays; multiLine for pretty-printed files |
| Parquet | Self-describing | Columnar, compressed, predicate pushdown, fast |
| Delta | Self-describing | Parquet + transaction log; ACID, time travel, schema enforcement |
CSV and JSON are text formats that carry no reliable type metadata, so Spark must either infer types (an extra full pass over the data) or accept a provided schema. Parquet stores the schema and statistics in its footer, enabling column pruning and predicate pushdown. Delta builds on Parquet by adding a transaction log (_delta_log) that delivers ACID guarantees, time travel, and schema enforcement — it is the default and preferred table format on Databricks.
Providing a Schema and Reading via SQL
Letting Spark infer a schema for CSV/JSON triggers an extra scan and can guess types incorrectly (e.g., a ZIP code as an integer, losing leading zeros). For production reads, supply an explicit schema as a StructType or a concise DDL string:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
schema = StructType([
StructField("id", IntegerType()),
StructField("name", StringType())])
df = spark.read.schema(schema).json("/data/users")
# or: spark.read.schema("id INT, name STRING").json("/data/users")
In Spark SQL you can query files directly without first creating a table using the format.\path`` syntax:
SELECT * FROM json.`/data/events`;
SELECT count(*) FROM parquet.`/data/sales`;
This is convenient for exploration but offers no caching or statistics; for repeated access, register a table or view instead.
Read Modes, Common Options, and Lazy Plans
When reading semi-structured text, Spark must decide what to do with malformed records. The CSV and JSON readers accept a mode option:
| Mode | Behavior on bad records |
|---|---|
PERMISSIVE (default) | Set unparsable fields to null; corrupt rows go to _corrupt_record |
DROPMALFORMED | Discard rows that fail to parse |
FAILFAST | Throw an exception on the first bad record |
Other frequently used CSV/JSON options include sep (delimiter), nullValue, dateFormat, escape, quote, and multiLine (for JSON objects or CSV fields that span lines). For temporary views over files you can also use spark.read...createOrReplaceTempView("v") to query the data by name.
Because reads are lazy, you can layer transformations — select, filter, withColumn — and Spark's Catalyst optimizer fuses them into one physical plan, only scanning the columns and partitions actually needed when an action finally runs. This means putting a narrow select and a selective filter right after a Parquet/Delta read lets Spark prune columns and push the predicate down to the file footer, dramatically reducing I/O. With text formats no such statistics exist, so Spark must read whole files.
A practical rule for the exam: choose the most structured format available (Delta > Parquet > JSON > CSV), supply an explicit schema for text sources, and rely on laziness plus column pruning for performance rather than reading then discarding data in Python. Note too that DDL-string schemas ("id INT, name STRING") are far more concise than building a full StructType and are equally valid wherever a schema is accepted.
Transformations vs Actions and Inspecting a DataFrame
Understanding the transformation/action split is fundamental to how Spark reads and processes data. Transformations (select, filter, withColumn, join, groupBy, orderBy) are lazy — they build up the logical plan but move no data. Only an action triggers execution and reads files:
| Common actions | What they do |
|---|---|
count() | Return the number of rows |
collect() | Bring all rows to the driver (use cautiously) |
take(n) / show(n) | Return / display the first n rows |
write / save / toTable | Persist the result |
display(df) | Databricks-native rich rendering |
A further distinction inside transformations: narrow transformations (select, filter) need no data movement between partitions, while wide transformations (groupBy, join, distinct) trigger a shuffle across the cluster. Knowing which operations shuffle helps explain query cost.
To understand what data a read produced, use df.printSchema() to see the inferred or supplied structure, df.columns for column names, and df.dtypes for types. df.explain() prints the physical plan, revealing whether column pruning and predicate pushdown were applied to a Parquet or Delta scan. These inspection tools are how an engineer confirms that a read is efficient before scaling it to production volumes — exactly the diagnostic workflow the certification expects.
When reading a CSV with column names in the first row, which option must you set so Spark does not treat that row as data?
Why is Parquet generally preferred over CSV for analytics reads in Spark?
Which Spark SQL statement correctly queries JSON files directly from a path without creating a table?