3.3 Advanced SQL Transformations
Key Takeaways
- Join types (INNER, LEFT, RIGHT, FULL OUTER, CROSS, plus SEMI and ANTI) combine and filter silver-layer tables in different ways.
- Window functions (ROW_NUMBER, RANK, LAG, LEAD, SUM OVER) compute ranked and running results without collapsing rows like GROUP BY does.
- PIVOT rotates row values into columns for crosstab reporting, while explode/posexplode flatten arrays into rows.
- Complex types (STRUCT, ARRAY, MAP) are accessed with dot, index, and explode operations to work with nested JSON.
- Conditional and null-handling functions (CASE WHEN, COALESCE, NULLIF) encode business logic and protect against NULL surprises.
Join Types
Joins combine rows from two tables on a condition and are central to building silver and gold tables. Know precisely what each returns:
| Join | Returns |
|---|---|
| INNER | Only rows matching in both tables |
| LEFT (OUTER) | All left rows; NULLs where right has no match |
| RIGHT (OUTER) | All right rows; NULLs where left has no match |
| FULL OUTER | All rows from both; NULLs on the non-matching side |
| CROSS | Cartesian product (every left row × every right row) |
| LEFT SEMI | Left rows that have a match (no right columns added) |
| LEFT ANTI | Left rows that have NO match in the right table |
SELECT o.id, c.name
FROM orders o
LEFT JOIN customers c ON o.cust_id = c.id;
A LEFT ANTI JOIN is the idiomatic way to find records present in one table but missing in another (for example, orders with no matching customer) — a frequent exam scenario where candidates wrongly pick a LEFT JOIN with an IS NULL filter as the "only" option.
Window Functions
Window functions compute a value across a set of rows related to the current row, but unlike GROUP BY they do not collapse rows. They use OVER (PARTITION BY ... ORDER BY ...).
SELECT *,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_ts DESC) AS rn,
SUM(amount) OVER (PARTITION BY user_id ORDER BY event_ts) AS running_total
FROM events;
- ROW_NUMBER() assigns a unique sequential number per partition — the backbone of deduplication (keep
rn = 1). - RANK / DENSE_RANK rank with ties; RANK skips numbers after ties, DENSE_RANK does not.
- LAG / LEAD access a previous or next row's value (period-over-period comparisons).
- SUM/AVG OVER produce running or moving aggregates.
Reshaping: PIVOT and EXPLODE
PIVOT rotates distinct row values into columns to build crosstabs:
SELECT * FROM sales
PIVOT (SUM(amount) FOR quarter IN ('Q1','Q2','Q3','Q4'));
EXPLODE does the opposite for arrays, generating one output row per array element; posexplode also returns the element's position. These flatten nested JSON arrays into a tabular silver shape.
Complex Types and Conditional Logic
Databricks SQL natively supports nested complex types common in JSON sources:
- STRUCT — access fields with dot notation:
address.city. - ARRAY — index with brackets (
tags[0]) or flatten withexplode. - MAP — look up by key:
attributes['color'].
These let you query semi-structured bronze data without first fully flattening it, then progressively flatten into silver.
Conditional and Null Functions
Business logic and data cleansing rely heavily on conditional and null-handling expressions:
- CASE WHEN ... THEN ... ELSE ... END — multi-branch logic (e.g., bucketing amounts into tiers).
- COALESCE(a, b, c) — returns the first non-NULL argument; the standard way to supply defaults.
- NULLIF(a, b) — returns NULL when
a = b, useful to avoid divide-by-zero. - IF(cond, t, f) and NVL(a, b) — shorthand conditionals.
Set operators round out the toolkit: UNION (distinct) and UNION ALL (keeps duplicates) stack result sets, INTERSECT returns common rows, and EXCEPT (MINUS) returns rows in the first query not in the second.
Higher-Order Functions and Aggregation Nuances
For arrays you do not always want to explode-then-regroup. Higher-order functions operate on array elements in place:
- transform(arr, x -> x * 2) applies an expression to each element.
- filter(arr, x -> x > 0) keeps elements matching a predicate.
- aggregate(arr, 0, (acc, x) -> acc + x) reduces an array to a single value.
- array_contains, size, element_at inspect arrays directly.
These keep transformations vectorized and avoid the row-explosion cost of explode.
Aggregation traps
- COUNT(*) vs COUNT(col):
COUNT(*)counts all rows including NULLs, whileCOUNT(col)ignores NULLs in that column — a classic distractor. - HAVING vs WHERE:
WHEREfilters rows before grouping;HAVINGfilters after aggregation (e.g.,HAVING SUM(amount) > 1000). - GROUP BY with multiple grouping sets:
GROUPING SETS,ROLLUP, andCUBEproduce subtotals and grand totals in one pass, useful for gold reporting tables. - collect_list vs collect_set: both gather values into an array per group, but
collect_setremoves duplicates.
These functions and distinctions show up constantly in scenario questions where you must pick the SQL that produces a specific shaped result for a silver or gold table.
Parsing Semi-Structured Data
Bronze data often arrives as JSON strings or with nested payloads. Databricks SQL gives you functions to parse and flatten these into silver:
- from_json(col, schema) parses a JSON string column into a STRUCT using a supplied schema.
- schema_of_json(sample) infers a schema from a sample string.
- to_json(struct) serializes a STRUCT back to a JSON string.
- The colon operator (
payload:user.id) extracts a value directly from a JSON string column without first parsing it into a STRUCT — a concise shortcut tested on the exam.
SELECT raw:event_type::string AS event_type,
from_json(raw, 'struct<id:int,city:string>') AS parsed
FROM bronze_events;
The :: is a cast shorthand (raw:amount::double). Combined with explode for arrays and dot access for structs, these let you progressively flatten a bronze JSON blob into a clean, typed silver table.
Date and string essentials
Time and text cleansing dominate silver transformations: to_date / to_timestamp parse strings into temporal types, date_trunc('month', ts) rolls timestamps to a boundary, datediff and date_add do arithmetic, and regexp_extract / regexp_replace clean messy text with patterns. Knowing that to_date returns NULL on an unparseable string (rather than erroring) is a subtle point that affects how many rows survive a silver load.
You need every order even when no matching customer record exists, with customer columns filled as NULL for unmatched orders. Which join do you use?
Which function returns the FIRST non-NULL value from its list of arguments?
How do window functions differ from GROUP BY aggregations?
Which set operator returns rows that exist in the first query's result but NOT in the second's?