A data engineer needs to find the most recent order for each customer. Which approach is most appropriate?

Use ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) and filter where row_num = 1. ROW_NUMBER() with PARTITION BY customer_id ORDER BY order_date DESC assigns a sequential number to each order per customer, starting with the most recent. Filtering where row_num = 1 gives the latest order for each customer with all columns preserved.

What is the difference between RANK() and DENSE_RANK() when there are tied values?

RANK() leaves gaps after ties (1, 2, 2, 4); DENSE_RANK() does not leave gaps (1, 2, 2, 3). RANK() leaves gaps in the ranking sequence after ties (e.g., 1, 2, 2, 4) while DENSE_RANK() does not leave gaps (e.g., 1, 2, 2, 3). ROW_NUMBER() assigns unique sequential numbers even for ties.

Which SQL operation transforms rows into columns (long format to wide format)?

PIVOT. PIVOT transforms rows into columns (long to wide format). UNPIVOT does the reverse — it transforms columns into rows (wide to long format). TRANSPOSE and RESHAPE are not standard SQL operations.

Spark SQL Query Fundamentals

Quick Answer: The exam tests practical SQL skills including CTEs, window functions (ROW_NUMBER, RANK, LAG, LEAD), PIVOT/UNPIVOT, and subqueries. Focus on understanding when and how to use each technique for data transformation scenarios.

Common Table Expressions (CTEs)

CTEs create named temporary result sets that simplify complex queries:

-- CTE with the WITH clause
WITH daily_sales AS (
    SELECT
        order_date,
        SUM(amount) AS total_sales,
        COUNT(*) AS order_count
    FROM orders
    GROUP BY order_date
),
ranked_days AS (
    SELECT
        *,
        RANK() OVER (ORDER BY total_sales DESC) AS sales_rank
    FROM daily_sales
)
SELECT * FROM ranked_days WHERE sales_rank <= 10;

Multiple CTEs

You can chain multiple CTEs separated by commas
Each CTE can reference previously defined CTEs
CTEs only exist for the duration of the query
Use CTEs to improve readability of complex transformations

Window Functions

Window functions perform calculations across a set of rows related to the current row without collapsing them into a single output row.

Syntax Pattern

function_name() OVER (
    PARTITION BY column_name
    ORDER BY column_name
    [ROWS/RANGE frame_specification]
)

Ranking Functions

SELECT
    customer_id,
    order_date,
    amount,
    ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) AS row_num,
    RANK() OVER (PARTITION BY customer_id ORDER BY amount DESC) AS rank,
    DENSE_RANK() OVER (PARTITION BY customer_id ORDER BY amount DESC) AS dense_rank
FROM orders;

Function	Behavior	Ties
ROW_NUMBER()	Unique sequential number per partition	No ties — arbitrary order
RANK()	Ranking with gaps for ties	1, 2, 2, 4
DENSE_RANK()	Ranking without gaps for ties	1, 2, 2, 3

Analytic Functions

SELECT
    order_date,
    daily_sales,
    LAG(daily_sales, 1) OVER (ORDER BY order_date) AS prev_day_sales,
    LEAD(daily_sales, 1) OVER (ORDER BY order_date) AS next_day_sales,
    SUM(daily_sales) OVER (ORDER BY order_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total,
    AVG(daily_sales) OVER (ORDER BY order_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS seven_day_avg
FROM daily_summary;

Function	Purpose
LAG(col, n)	Access value from n rows before the current row
LEAD(col, n)	Access value from n rows after the current row
SUM() OVER (...)	Running sum or window-based aggregation
AVG() OVER (...)	Moving average or window-based average
FIRST_VALUE(col)	First value in the window frame
LAST_VALUE(col)	Last value in the window frame

PIVOT and UNPIVOT

PIVOT (Long to Wide)

-- Transform rows into columns
SELECT * FROM (
    SELECT month, product_category, revenue
    FROM monthly_sales
)
PIVOT (
    SUM(revenue)
    FOR product_category IN ('Electronics', 'Clothing', 'Food')
);
-- Result: month | Electronics | Clothing | Food

UNPIVOT (Wide to Long)

-- Transform columns into rows
SELECT * FROM wide_sales
UNPIVOT (
    revenue FOR product_category IN (Electronics, Clothing, Food)
);
-- Result: month | product_category | revenue

Subqueries

-- Scalar subquery in SELECT
SELECT
    customer_id,
    total_spend,
    (SELECT AVG(total_spend) FROM customers) AS avg_spend
FROM customers;

-- Subquery in WHERE
SELECT * FROM orders
WHERE customer_id IN (
    SELECT customer_id FROM customers WHERE segment = 'Premium'
);

-- Correlated subquery
SELECT * FROM orders o
WHERE amount > (
    SELECT AVG(amount) FROM orders WHERE customer_id = o.customer_id
);

-- Subquery in FROM
SELECT segment, avg_order
FROM (
    SELECT customer_segment AS segment, AVG(amount) AS avg_order
    FROM orders GROUP BY customer_segment
) WHERE avg_order > 100;

On the Exam: Window functions are heavily tested. Be comfortable with ROW_NUMBER for deduplication, LAG/LEAD for comparing consecutive rows, and running totals/averages with SUM/AVG OVER().

Databricks Certified Data Engineer Associate

2.4 Spark SQL Query Fundamentals

Key Takeaways

Spark SQL Query Fundamentals

Common Table Expressions (CTEs)

Multiple CTEs

Window Functions

Syntax Pattern

Ranking Functions

Analytic Functions

PIVOT and UNPIVOT

PIVOT (Long to Wide)

UNPIVOT (Wide to Long)

Subqueries

Databricks Certified Data Engineer Associate

1Introduction

2Domain 1: Databricks Intelligence Platform (10%)

3Domain 2: Development and Ingestion (30%)

4Domain 3: Data Processing & Transformations (31%)

5Domain 4: Productionizing Data Pipelines (18%)

6Domain 5: Data Governance & Quality (11%)

2.4 Spark SQL Query Fundamentals

Key Takeaways

Spark SQL Query Fundamentals

Common Table Expressions (CTEs)

Multiple CTEs

Window Functions

Syntax Pattern

Ranking Functions

Analytic Functions

PIVOT and UNPIVOT

PIVOT (Long to Wide)

UNPIVOT (Wide to Long)

Subqueries