2.4 Spark SQL Query Fundamentals
Key Takeaways
- Logical evaluation order is FROM, WHERE, GROUP BY, HAVING, SELECT, ORDER BY — WHERE filters rows before grouping, HAVING filters groups after aggregation.
- Join types include INNER, LEFT/RIGHT/FULL OUTER, LEFT SEMI, LEFT ANTI, and CROSS; semi and anti joins filter the left side without duplicating columns.
- Common table expressions (WITH ... AS) name a subquery to make multi-step queries readable and reusable within one statement.
- Window functions (OVER with PARTITION BY / ORDER BY) compute ranking and running totals without collapsing rows, unlike GROUP BY.
- ROW_NUMBER, RANK, and DENSE_RANK differ in how they handle ties when ranking partitioned, ordered data.
Query Structure and Evaluation Order
A Spark SQL SELECT is written in one order but logically evaluated in another. Understanding this order explains why some clauses can reference aliases and others cannot:
| Step | Clause | Role |
|---|---|---|
| 1 | FROM / JOIN | Choose and combine sources |
| 2 | WHERE | Filter individual rows (pre-aggregation) |
| 3 | GROUP BY | Collapse rows into groups |
| 4 | HAVING | Filter groups (post-aggregation) |
| 5 | SELECT | Compute output columns |
| 6 | ORDER BY | Sort the result |
The key tested distinction: WHERE filters rows before grouping, so it cannot reference aggregate functions; HAVING filters after aggregation, so HAVING count(*) > 5 is valid but WHERE count(*) > 5 is an error. Aggregate functions (count, sum, avg, min, max) combine many rows into one value per group.
Joins
Spark supports the full set of SQL joins. Beyond the familiar inner/outer joins, semi and anti joins are frequently tested because they filter the left table without adding the right table's columns:
| Join | Result |
|---|---|
INNER JOIN | Only matching rows from both |
LEFT OUTER JOIN | All left rows; nulls where right has no match |
RIGHT / FULL OUTER | Symmetric / all rows from both |
LEFT SEMI JOIN | Left rows that have a match (no right columns) |
LEFT ANTI JOIN | Left rows that have no match |
CROSS JOIN | Cartesian product |
SELECT c.* FROM customers c
LEFT ANTI JOIN orders o ON c.id = o.cust_id; -- customers with zero orders
A LEFT SEMI JOIN is logically an EXISTS filter; a LEFT ANTI JOIN is a NOT EXISTS filter. Neither duplicates left rows even if multiple right matches exist, which is why they are preferred over inner joins for pure filtering.
CTEs and Window Functions
A common table expression (CTE) names a subquery with WITH, improving readability for multi-step logic:
WITH ranked AS (
SELECT region, sales,
ROW_NUMBER() OVER (PARTITION BY region ORDER BY sales DESC) AS rn
FROM monthly)
SELECT * FROM ranked WHERE rn = 1; -- top seller per region
Window functions compute a value across a set of rows related to the current row without collapsing them — unlike GROUP BY, the row count is preserved. The OVER clause defines the window with PARTITION BY (groups) and ORDER BY (sequence). Ranking functions differ on ties:
| Function | Behavior on ties |
|---|---|
ROW_NUMBER() | Unique 1,2,3 even for ties (arbitrary) |
RANK() | Ties share a rank, then a gap (1,1,3) |
DENSE_RANK() | Ties share a rank, no gap (1,1,2) |
Other window functions include LAG/LEAD (previous/next row) and running aggregates like SUM(x) OVER (ORDER BY d).
Aggregation, Subqueries, and Set Operations
Aggregate queries collapse many rows into summary values. The standard aggregates are count, count(DISTINCT col), sum, avg, min, max, plus collect_list/collect_set to gather values into an array. GROUP BY defines the grouping keys, and GROUP BY with ROLLUP or CUBE produces subtotals and grand totals — useful for reporting cubes.
Spark SQL also supports subqueries in several positions: scalar subqueries (returning one value) in SELECT, correlated subqueries with EXISTS/IN in WHERE, and derived tables in FROM. As shown with semi/anti joins, an EXISTS correlated subquery is often rewritten by Catalyst into a semi join.
Set operations combine two result sets that share a schema:
| Operator | Result |
|---|---|
UNION | Rows from both, duplicates removed |
UNION ALL | Rows from both, duplicates kept (faster) |
INTERSECT | Rows in both |
EXCEPT / MINUS | Rows in the first not in the second |
A frequent exam trap: UNION deduplicates and therefore incurs a shuffle, while UNION ALL simply concatenates and is cheaper — choose UNION ALL unless you specifically need distinct rows. Finally, CASE WHEN ... THEN ... ELSE ... END provides conditional logic inside SELECT, and COALESCE(a, b) returns the first non-null argument, both heavily used when shaping query output. Combining CTEs, window functions, and aggregates lets you express sophisticated analytics in a single readable Spark SQL statement that Catalyst compiles into an optimized physical plan.
Two more tested mechanics: ORDER BY sorts the final result and can reference SELECT aliases (since it runs last), whereas WHERE and GROUP BY cannot reference SELECT aliases because they run earlier in the logical order. LIMIT n caps the returned rows and, on Databricks, QUALIFY filters directly on a window-function result (QUALIFY ROW_NUMBER() OVER (...) = 1) without needing an outer subquery — a concise alternative to the CTE-plus-filter pattern for top-N-per-group queries.
Null handling also trips up many candidates. In SQL, NULL is unknown, so col = NULL is never true — you must write col IS NULL / IS NOT NULL. Aggregates like count(col) ignore nulls while count(*) counts every row, and avg/sum skip nulls entirely. Use coalesce(a, b, c) to substitute the first non-null value and nullif(a, b) to turn a sentinel into null. When joining, rows with null keys never match in an inner join (null is not equal to null), which is a frequent cause of unexpectedly missing rows.
Mastering these null semantics, the logical evaluation order, the join taxonomy, and window functions covers the bulk of the Spark SQL questions on the exam.
Which clause filters groups AFTER aggregation, allowing a predicate like count(*) > 10?
You need every customer who has placed ZERO orders, returning only customer columns. Which join is most appropriate?
Three sales tie for the same amount. Which ranking function assigns them all rank 1 and then continues with rank 2 (no gap)?