2.4 Spark SQL Query Fundamentals
Key Takeaways
- Spark SQL supports standard SQL operations including SELECT, WHERE, GROUP BY, HAVING, ORDER BY, LIMIT, and JOIN.
- Common table expressions (CTEs) with the WITH clause simplify complex queries by breaking them into named subqueries.
- Window functions like ROW_NUMBER, RANK, LAG, and LEAD perform calculations across sets of rows related to the current row.
- PIVOT and UNPIVOT transform data between wide and long formats for analytics and reporting.
- Subqueries can be used in SELECT, FROM, WHERE, and HAVING clauses for flexible data retrieval.
Spark SQL Query Fundamentals
Quick Answer: The exam tests practical SQL skills including CTEs, window functions (ROW_NUMBER, RANK, LAG, LEAD), PIVOT/UNPIVOT, and subqueries. Focus on understanding when and how to use each technique for data transformation scenarios.
Common Table Expressions (CTEs)
CTEs create named temporary result sets that simplify complex queries:
-- CTE with the WITH clause
WITH daily_sales AS (
SELECT
order_date,
SUM(amount) AS total_sales,
COUNT(*) AS order_count
FROM orders
GROUP BY order_date
),
ranked_days AS (
SELECT
*,
RANK() OVER (ORDER BY total_sales DESC) AS sales_rank
FROM daily_sales
)
SELECT * FROM ranked_days WHERE sales_rank <= 10;
Multiple CTEs
- You can chain multiple CTEs separated by commas
- Each CTE can reference previously defined CTEs
- CTEs only exist for the duration of the query
- Use CTEs to improve readability of complex transformations
Window Functions
Window functions perform calculations across a set of rows related to the current row without collapsing them into a single output row.
Syntax Pattern
function_name() OVER (
PARTITION BY column_name
ORDER BY column_name
[ROWS/RANGE frame_specification]
)
Ranking Functions
SELECT
customer_id,
order_date,
amount,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) AS row_num,
RANK() OVER (PARTITION BY customer_id ORDER BY amount DESC) AS rank,
DENSE_RANK() OVER (PARTITION BY customer_id ORDER BY amount DESC) AS dense_rank
FROM orders;
| Function | Behavior | Ties |
|---|---|---|
| ROW_NUMBER() | Unique sequential number per partition | No ties — arbitrary order |
| RANK() | Ranking with gaps for ties | 1, 2, 2, 4 |
| DENSE_RANK() | Ranking without gaps for ties | 1, 2, 2, 3 |
Analytic Functions
SELECT
order_date,
daily_sales,
LAG(daily_sales, 1) OVER (ORDER BY order_date) AS prev_day_sales,
LEAD(daily_sales, 1) OVER (ORDER BY order_date) AS next_day_sales,
SUM(daily_sales) OVER (ORDER BY order_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total,
AVG(daily_sales) OVER (ORDER BY order_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS seven_day_avg
FROM daily_summary;
| Function | Purpose |
|---|---|
| LAG(col, n) | Access value from n rows before the current row |
| LEAD(col, n) | Access value from n rows after the current row |
| SUM() OVER (...) | Running sum or window-based aggregation |
| AVG() OVER (...) | Moving average or window-based average |
| FIRST_VALUE(col) | First value in the window frame |
| LAST_VALUE(col) | Last value in the window frame |
PIVOT and UNPIVOT
PIVOT (Long to Wide)
-- Transform rows into columns
SELECT * FROM (
SELECT month, product_category, revenue
FROM monthly_sales
)
PIVOT (
SUM(revenue)
FOR product_category IN ('Electronics', 'Clothing', 'Food')
);
-- Result: month | Electronics | Clothing | Food
UNPIVOT (Wide to Long)
-- Transform columns into rows
SELECT * FROM wide_sales
UNPIVOT (
revenue FOR product_category IN (Electronics, Clothing, Food)
);
-- Result: month | product_category | revenue
Subqueries
-- Scalar subquery in SELECT
SELECT
customer_id,
total_spend,
(SELECT AVG(total_spend) FROM customers) AS avg_spend
FROM customers;
-- Subquery in WHERE
SELECT * FROM orders
WHERE customer_id IN (
SELECT customer_id FROM customers WHERE segment = 'Premium'
);
-- Correlated subquery
SELECT * FROM orders o
WHERE amount > (
SELECT AVG(amount) FROM orders WHERE customer_id = o.customer_id
);
-- Subquery in FROM
SELECT segment, avg_order
FROM (
SELECT customer_segment AS segment, AVG(amount) AS avg_order
FROM orders GROUP BY customer_segment
) WHERE avg_order > 100;
On the Exam: Window functions are heavily tested. Be comfortable with ROW_NUMBER for deduplication, LAG/LEAD for comparing consecutive rows, and running totals/averages with SUM/AVG OVER().
A data engineer needs to find the most recent order for each customer. Which approach is most appropriate?
What is the difference between RANK() and DENSE_RANK() when there are tied values?
Which SQL operation transforms rows into columns (long format to wide format)?