3.3 Advanced SQL Transformations

Key Takeaways

Join types (INNER, LEFT, RIGHT, FULL OUTER, CROSS, plus SEMI and ANTI) combine and filter silver-layer tables in different ways.
Window functions (ROW_NUMBER, RANK, LAG, LEAD, SUM OVER) compute ranked and running results without collapsing rows like GROUP BY does.
PIVOT rotates row values into columns for crosstab reporting, while explode/posexplode flatten arrays into rows.
Complex types (STRUCT, ARRAY, MAP) are accessed with dot, index, and explode operations to work with nested JSON.
Conditional and null-handling functions (CASE WHEN, COALESCE, NULLIF) encode business logic and protect against NULL surprises.

Last updated: June 2026

Join Types

Joins combine rows from two tables on a condition and are central to building silver and gold tables. Know precisely what each returns:

Join	Returns
INNER	Only rows matching in both tables
LEFT (OUTER)	All left rows; NULLs where right has no match
RIGHT (OUTER)	All right rows; NULLs where left has no match
FULL OUTER	All rows from both; NULLs on the non-matching side
CROSS	Cartesian product (every left row × every right row)
LEFT SEMI	Left rows that have a match (no right columns added)
LEFT ANTI	Left rows that have NO match in the right table

SELECT o.id, c.name
FROM orders o
LEFT JOIN customers c ON o.cust_id = c.id;

A LEFT ANTI JOIN is the idiomatic way to find records present in one table but missing in another (for example, orders with no matching customer) — a frequent exam scenario where candidates wrongly pick a LEFT JOIN with an IS NULL filter as the "only" option.

Window Functions

Window functions compute a value across a set of rows related to the current row, but unlike GROUP BY they do not collapse rows. They use OVER (PARTITION BY ... ORDER BY ...).

SELECT *,
  ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_ts DESC) AS rn,
  SUM(amount) OVER (PARTITION BY user_id ORDER BY event_ts) AS running_total
FROM events;

ROW_NUMBER() assigns a unique sequential number per partition — the backbone of deduplication (keep rn = 1).
RANK / DENSE_RANK rank with ties; RANK skips numbers after ties, DENSE_RANK does not.
LAG / LEAD access a previous or next row's value (period-over-period comparisons).
SUM/AVG OVER produce running or moving aggregates.

Reshaping: PIVOT and EXPLODE

PIVOT rotates distinct row values into columns to build crosstabs:

SELECT * FROM sales
PIVOT (SUM(amount) FOR quarter IN ('Q1','Q2','Q3','Q4'));

EXPLODE does the opposite for arrays, generating one output row per array element; posexplode also returns the element's position. These flatten nested JSON arrays into a tabular silver shape.

Complex Types and Conditional Logic

Databricks SQL natively supports nested complex types common in JSON sources:

STRUCT — access fields with dot notation: address.city.
ARRAY — index with brackets (tags[0]) or flatten with explode.
MAP — look up by key: attributes['color'].

These let you query semi-structured bronze data without first fully flattening it, then progressively flatten into silver.

Conditional and Null Functions

Business logic and data cleansing rely heavily on conditional and null-handling expressions:

CASE WHEN ... THEN ... ELSE ... END — multi-branch logic (e.g., bucketing amounts into tiers).
COALESCE(a, b, c) — returns the first non-NULL argument; the standard way to supply defaults.
NULLIF(a, b) — returns NULL when a = b, useful to avoid divide-by-zero.
IF(cond, t, f) and NVL(a, b) — shorthand conditionals.

Set operators round out the toolkit: UNION (distinct) and UNION ALL (keeps duplicates) stack result sets, INTERSECT returns common rows, and EXCEPT (MINUS) returns rows in the first query not in the second.

Higher-Order Functions and Aggregation Nuances

For arrays you do not always want to explode-then-regroup. Higher-order functions operate on array elements in place:

transform(arr, x -> x * 2) applies an expression to each element.
filter(arr, x -> x > 0) keeps elements matching a predicate.
aggregate(arr, 0, (acc, x) -> acc + x) reduces an array to a single value.
array_contains, size, element_at inspect arrays directly.

These keep transformations vectorized and avoid the row-explosion cost of explode.

Aggregation traps

COUNT(*) vs COUNT(col): COUNT(*) counts all rows including NULLs, while COUNT(col) ignores NULLs in that column — a classic distractor.
HAVING vs WHERE: WHERE filters rows before grouping; HAVING filters after aggregation (e.g., HAVING SUM(amount) > 1000).
GROUP BY with multiple grouping sets: GROUPING SETS, ROLLUP, and CUBE produce subtotals and grand totals in one pass, useful for gold reporting tables.
collect_list vs collect_set: both gather values into an array per group, but collect_set removes duplicates.

These functions and distinctions show up constantly in scenario questions where you must pick the SQL that produces a specific shaped result for a silver or gold table.

Parsing Semi-Structured Data

Bronze data often arrives as JSON strings or with nested payloads. Databricks SQL gives you functions to parse and flatten these into silver:

from_json(col, schema) parses a JSON string column into a STRUCT using a supplied schema.
schema_of_json(sample) infers a schema from a sample string.
to_json(struct) serializes a STRUCT back to a JSON string.
The colon operator (payload:user.id) extracts a value directly from a JSON string column without first parsing it into a STRUCT — a concise shortcut tested on the exam.

SELECT raw:event_type::string AS event_type,
       from_json(raw, 'struct<id:int,city:string>') AS parsed
FROM bronze_events;

The :: is a cast shorthand (raw:amount::double). Combined with explode for arrays and dot access for structs, these let you progressively flatten a bronze JSON blob into a clean, typed silver table.

Date and string essentials

Time and text cleansing dominate silver transformations: to_date / to_timestamp parse strings into temporal types, date_trunc('month', ts) rolls timestamps to a boundary, datediff and date_add do arithmetic, and regexp_extract / regexp_replace clean messy text with patterns. Knowing that to_date returns NULL on an unparseable string (rather than erroring) is a subtle point that affects how many rows survive a silver load.

Test Your Knowledge

You need every order even when no matching customer record exists, with customer columns filled as NULL for unmatched orders. Which join do you use?

INNER JOIN

LEFT (OUTER) JOIN with orders on the left

CROSS JOIN

LEFT ANTI JOIN

Test Your Knowledge

Which function returns the FIRST non-NULL value from its list of arguments?

NULLIF

ROW_NUMBER

COALESCE

EXPLODE

Test Your Knowledge

How do window functions differ from GROUP BY aggregations?

Window functions collapse each group into a single row, just like GROUP BY

Window functions can only be used on streaming tables

Window functions compute results across related rows while preserving the individual rows

Window functions cannot use PARTITION BY

Test Your Knowledge

Which set operator returns rows that exist in the first query's result but NOT in the second's?

UNION ALL, which stacks both results keeping duplicates

INTERSECT, which returns rows common to both

UNION, which returns the distinct combination of both

EXCEPT, which subtracts the second result from the first

Up Next

3.4 Incremental Data Processing Patterns

Continue learning

Databricks Certified Data Engineer Associate

Databricks Certified Data Engineer Associate

3.3 Advanced SQL Transformations

Key Takeaways

Join Types

Window Functions

Reshaping: PIVOT and EXPLODE

Complex Types and Conditional Logic

Conditional and Null Functions

Higher-Order Functions and Aggregation Nuances

Aggregation traps

Parsing Semi-Structured Data

Date and string essentials

Databricks Certified Data Engineer Associate

1Introduction

2Domain 1: Databricks Intelligence Platform (10%)

3Domain 2: Development and Ingestion (30%)

4Domain 3: Data Processing & Transformations (31%)

5Domain 4: Productionizing Data Pipelines (18%)

6Domain 5: Data Governance & Quality (11%)

Databricks Certified Data Engineer Associate

3.3 Advanced SQL Transformations

Key Takeaways

Join Types

Window Functions

Reshaping: PIVOT and EXPLODE

Complex Types and Conditional Logic

Conditional and Null Functions

Higher-Order Functions and Aggregation Nuances

Aggregation traps

Parsing Semi-Structured Data

Date and string essentials