2.4 Spark SQL Query Fundamentals

Key Takeaways

  • Logical evaluation order is FROM, WHERE, GROUP BY, HAVING, SELECT, ORDER BY — WHERE filters rows before grouping, HAVING filters groups after aggregation.
  • Join types include INNER, LEFT/RIGHT/FULL OUTER, LEFT SEMI, LEFT ANTI, and CROSS; semi and anti joins filter the left side without duplicating columns.
  • Common table expressions (WITH ... AS) name a subquery to make multi-step queries readable and reusable within one statement.
  • Window functions (OVER with PARTITION BY / ORDER BY) compute ranking and running totals without collapsing rows, unlike GROUP BY.
  • ROW_NUMBER, RANK, and DENSE_RANK differ in how they handle ties when ranking partitioned, ordered data.
Last updated: June 2026

Query Structure and Evaluation Order

A Spark SQL SELECT is written in one order but logically evaluated in another. Understanding this order explains why some clauses can reference aliases and others cannot:

StepClauseRole
1FROM / JOINChoose and combine sources
2WHEREFilter individual rows (pre-aggregation)
3GROUP BYCollapse rows into groups
4HAVINGFilter groups (post-aggregation)
5SELECTCompute output columns
6ORDER BYSort the result

The key tested distinction: WHERE filters rows before grouping, so it cannot reference aggregate functions; HAVING filters after aggregation, so HAVING count(*) > 5 is valid but WHERE count(*) > 5 is an error. Aggregate functions (count, sum, avg, min, max) combine many rows into one value per group.

Joins

Spark supports the full set of SQL joins. Beyond the familiar inner/outer joins, semi and anti joins are frequently tested because they filter the left table without adding the right table's columns:

JoinResult
INNER JOINOnly matching rows from both
LEFT OUTER JOINAll left rows; nulls where right has no match
RIGHT / FULL OUTERSymmetric / all rows from both
LEFT SEMI JOINLeft rows that have a match (no right columns)
LEFT ANTI JOINLeft rows that have no match
CROSS JOINCartesian product
SELECT c.* FROM customers c
LEFT ANTI JOIN orders o ON c.id = o.cust_id;  -- customers with zero orders

A LEFT SEMI JOIN is logically an EXISTS filter; a LEFT ANTI JOIN is a NOT EXISTS filter. Neither duplicates left rows even if multiple right matches exist, which is why they are preferred over inner joins for pure filtering.

CTEs and Window Functions

A common table expression (CTE) names a subquery with WITH, improving readability for multi-step logic:

WITH ranked AS (
  SELECT region, sales,
         ROW_NUMBER() OVER (PARTITION BY region ORDER BY sales DESC) AS rn
  FROM monthly)
SELECT * FROM ranked WHERE rn = 1;  -- top seller per region

Window functions compute a value across a set of rows related to the current row without collapsing them — unlike GROUP BY, the row count is preserved. The OVER clause defines the window with PARTITION BY (groups) and ORDER BY (sequence). Ranking functions differ on ties:

FunctionBehavior on ties
ROW_NUMBER()Unique 1,2,3 even for ties (arbitrary)
RANK()Ties share a rank, then a gap (1,1,3)
DENSE_RANK()Ties share a rank, no gap (1,1,2)

Other window functions include LAG/LEAD (previous/next row) and running aggregates like SUM(x) OVER (ORDER BY d).

Aggregation, Subqueries, and Set Operations

Aggregate queries collapse many rows into summary values. The standard aggregates are count, count(DISTINCT col), sum, avg, min, max, plus collect_list/collect_set to gather values into an array. GROUP BY defines the grouping keys, and GROUP BY with ROLLUP or CUBE produces subtotals and grand totals — useful for reporting cubes.

Spark SQL also supports subqueries in several positions: scalar subqueries (returning one value) in SELECT, correlated subqueries with EXISTS/IN in WHERE, and derived tables in FROM. As shown with semi/anti joins, an EXISTS correlated subquery is often rewritten by Catalyst into a semi join.

Set operations combine two result sets that share a schema:

OperatorResult
UNIONRows from both, duplicates removed
UNION ALLRows from both, duplicates kept (faster)
INTERSECTRows in both
EXCEPT / MINUSRows in the first not in the second

A frequent exam trap: UNION deduplicates and therefore incurs a shuffle, while UNION ALL simply concatenates and is cheaper — choose UNION ALL unless you specifically need distinct rows. Finally, CASE WHEN ... THEN ... ELSE ... END provides conditional logic inside SELECT, and COALESCE(a, b) returns the first non-null argument, both heavily used when shaping query output. Combining CTEs, window functions, and aggregates lets you express sophisticated analytics in a single readable Spark SQL statement that Catalyst compiles into an optimized physical plan.

Two more tested mechanics: ORDER BY sorts the final result and can reference SELECT aliases (since it runs last), whereas WHERE and GROUP BY cannot reference SELECT aliases because they run earlier in the logical order. LIMIT n caps the returned rows and, on Databricks, QUALIFY filters directly on a window-function result (QUALIFY ROW_NUMBER() OVER (...) = 1) without needing an outer subquery — a concise alternative to the CTE-plus-filter pattern for top-N-per-group queries.

Null handling also trips up many candidates. In SQL, NULL is unknown, so col = NULL is never true — you must write col IS NULL / IS NOT NULL. Aggregates like count(col) ignore nulls while count(*) counts every row, and avg/sum skip nulls entirely. Use coalesce(a, b, c) to substitute the first non-null value and nullif(a, b) to turn a sentinel into null. When joining, rows with null keys never match in an inner join (null is not equal to null), which is a frequent cause of unexpectedly missing rows.

Mastering these null semantics, the logical evaluation order, the join taxonomy, and window functions covers the bulk of the Spark SQL questions on the exam.

Test Your Knowledge

Which clause filters groups AFTER aggregation, allowing a predicate like count(*) > 10?

A
B
C
D
Test Your Knowledge

You need every customer who has placed ZERO orders, returning only customer columns. Which join is most appropriate?

A
B
C
D
Test Your Knowledge

Three sales tie for the same amount. Which ranking function assigns them all rank 1 and then continues with rank 2 (no gap)?

A
B
C
D