2.4 Spark SQL Query Fundamentals

Key Takeaways

  • Spark SQL supports standard SQL operations including SELECT, WHERE, GROUP BY, HAVING, ORDER BY, LIMIT, and JOIN.
  • Common table expressions (CTEs) with the WITH clause simplify complex queries by breaking them into named subqueries.
  • Window functions like ROW_NUMBER, RANK, LAG, and LEAD perform calculations across sets of rows related to the current row.
  • PIVOT and UNPIVOT transform data between wide and long formats for analytics and reporting.
  • Subqueries can be used in SELECT, FROM, WHERE, and HAVING clauses for flexible data retrieval.
Last updated: March 2026

Spark SQL Query Fundamentals

Quick Answer: The exam tests practical SQL skills including CTEs, window functions (ROW_NUMBER, RANK, LAG, LEAD), PIVOT/UNPIVOT, and subqueries. Focus on understanding when and how to use each technique for data transformation scenarios.

Common Table Expressions (CTEs)

CTEs create named temporary result sets that simplify complex queries:

-- CTE with the WITH clause
WITH daily_sales AS (
    SELECT
        order_date,
        SUM(amount) AS total_sales,
        COUNT(*) AS order_count
    FROM orders
    GROUP BY order_date
),
ranked_days AS (
    SELECT
        *,
        RANK() OVER (ORDER BY total_sales DESC) AS sales_rank
    FROM daily_sales
)
SELECT * FROM ranked_days WHERE sales_rank <= 10;

Multiple CTEs

  • You can chain multiple CTEs separated by commas
  • Each CTE can reference previously defined CTEs
  • CTEs only exist for the duration of the query
  • Use CTEs to improve readability of complex transformations

Window Functions

Window functions perform calculations across a set of rows related to the current row without collapsing them into a single output row.

Syntax Pattern

function_name() OVER (
    PARTITION BY column_name
    ORDER BY column_name
    [ROWS/RANGE frame_specification]
)

Ranking Functions

SELECT
    customer_id,
    order_date,
    amount,
    ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) AS row_num,
    RANK() OVER (PARTITION BY customer_id ORDER BY amount DESC) AS rank,
    DENSE_RANK() OVER (PARTITION BY customer_id ORDER BY amount DESC) AS dense_rank
FROM orders;
FunctionBehaviorTies
ROW_NUMBER()Unique sequential number per partitionNo ties — arbitrary order
RANK()Ranking with gaps for ties1, 2, 2, 4
DENSE_RANK()Ranking without gaps for ties1, 2, 2, 3

Analytic Functions

SELECT
    order_date,
    daily_sales,
    LAG(daily_sales, 1) OVER (ORDER BY order_date) AS prev_day_sales,
    LEAD(daily_sales, 1) OVER (ORDER BY order_date) AS next_day_sales,
    SUM(daily_sales) OVER (ORDER BY order_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total,
    AVG(daily_sales) OVER (ORDER BY order_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS seven_day_avg
FROM daily_summary;
FunctionPurpose
LAG(col, n)Access value from n rows before the current row
LEAD(col, n)Access value from n rows after the current row
SUM() OVER (...)Running sum or window-based aggregation
AVG() OVER (...)Moving average or window-based average
FIRST_VALUE(col)First value in the window frame
LAST_VALUE(col)Last value in the window frame

PIVOT and UNPIVOT

PIVOT (Long to Wide)

-- Transform rows into columns
SELECT * FROM (
    SELECT month, product_category, revenue
    FROM monthly_sales
)
PIVOT (
    SUM(revenue)
    FOR product_category IN ('Electronics', 'Clothing', 'Food')
);
-- Result: month | Electronics | Clothing | Food

UNPIVOT (Wide to Long)

-- Transform columns into rows
SELECT * FROM wide_sales
UNPIVOT (
    revenue FOR product_category IN (Electronics, Clothing, Food)
);
-- Result: month | product_category | revenue

Subqueries

-- Scalar subquery in SELECT
SELECT
    customer_id,
    total_spend,
    (SELECT AVG(total_spend) FROM customers) AS avg_spend
FROM customers;

-- Subquery in WHERE
SELECT * FROM orders
WHERE customer_id IN (
    SELECT customer_id FROM customers WHERE segment = 'Premium'
);

-- Correlated subquery
SELECT * FROM orders o
WHERE amount > (
    SELECT AVG(amount) FROM orders WHERE customer_id = o.customer_id
);

-- Subquery in FROM
SELECT segment, avg_order
FROM (
    SELECT customer_segment AS segment, AVG(amount) AS avg_order
    FROM orders GROUP BY customer_segment
) WHERE avg_order > 100;

On the Exam: Window functions are heavily tested. Be comfortable with ROW_NUMBER for deduplication, LAG/LEAD for comparing consecutive rows, and running totals/averages with SUM/AVG OVER().

Test Your Knowledge

A data engineer needs to find the most recent order for each customer. Which approach is most appropriate?

A
B
C
D
Test Your Knowledge

What is the difference between RANK() and DENSE_RANK() when there are tied values?

A
B
C
D
Test Your Knowledge

Which SQL operation transforms rows into columns (long format to wide format)?

A
B
C
D