3.3 Advanced SQL Transformations

Key Takeaways

  • Join types (INNER, LEFT, RIGHT, FULL OUTER, CROSS, plus SEMI and ANTI) combine and filter silver-layer tables in different ways.
  • Window functions (ROW_NUMBER, RANK, LAG, LEAD, SUM OVER) compute ranked and running results without collapsing rows like GROUP BY does.
  • PIVOT rotates row values into columns for crosstab reporting, while explode/posexplode flatten arrays into rows.
  • Complex types (STRUCT, ARRAY, MAP) are accessed with dot, index, and explode operations to work with nested JSON.
  • Conditional and null-handling functions (CASE WHEN, COALESCE, NULLIF) encode business logic and protect against NULL surprises.
Last updated: June 2026

Join Types

Joins combine rows from two tables on a condition and are central to building silver and gold tables. Know precisely what each returns:

JoinReturns
INNEROnly rows matching in both tables
LEFT (OUTER)All left rows; NULLs where right has no match
RIGHT (OUTER)All right rows; NULLs where left has no match
FULL OUTERAll rows from both; NULLs on the non-matching side
CROSSCartesian product (every left row × every right row)
LEFT SEMILeft rows that have a match (no right columns added)
LEFT ANTILeft rows that have NO match in the right table
SELECT o.id, c.name
FROM orders o
LEFT JOIN customers c ON o.cust_id = c.id;

A LEFT ANTI JOIN is the idiomatic way to find records present in one table but missing in another (for example, orders with no matching customer) — a frequent exam scenario where candidates wrongly pick a LEFT JOIN with an IS NULL filter as the "only" option.

Window Functions

Window functions compute a value across a set of rows related to the current row, but unlike GROUP BY they do not collapse rows. They use OVER (PARTITION BY ... ORDER BY ...).

SELECT *,
  ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_ts DESC) AS rn,
  SUM(amount) OVER (PARTITION BY user_id ORDER BY event_ts) AS running_total
FROM events;
  • ROW_NUMBER() assigns a unique sequential number per partition — the backbone of deduplication (keep rn = 1).
  • RANK / DENSE_RANK rank with ties; RANK skips numbers after ties, DENSE_RANK does not.
  • LAG / LEAD access a previous or next row's value (period-over-period comparisons).
  • SUM/AVG OVER produce running or moving aggregates.

Reshaping: PIVOT and EXPLODE

PIVOT rotates distinct row values into columns to build crosstabs:

SELECT * FROM sales
PIVOT (SUM(amount) FOR quarter IN ('Q1','Q2','Q3','Q4'));

EXPLODE does the opposite for arrays, generating one output row per array element; posexplode also returns the element's position. These flatten nested JSON arrays into a tabular silver shape.

Complex Types and Conditional Logic

Databricks SQL natively supports nested complex types common in JSON sources:

  • STRUCT — access fields with dot notation: address.city.
  • ARRAY — index with brackets (tags[0]) or flatten with explode.
  • MAP — look up by key: attributes['color'].

These let you query semi-structured bronze data without first fully flattening it, then progressively flatten into silver.

Conditional and Null Functions

Business logic and data cleansing rely heavily on conditional and null-handling expressions:

  • CASE WHEN ... THEN ... ELSE ... END — multi-branch logic (e.g., bucketing amounts into tiers).
  • COALESCE(a, b, c) — returns the first non-NULL argument; the standard way to supply defaults.
  • NULLIF(a, b) — returns NULL when a = b, useful to avoid divide-by-zero.
  • IF(cond, t, f) and NVL(a, b) — shorthand conditionals.

Set operators round out the toolkit: UNION (distinct) and UNION ALL (keeps duplicates) stack result sets, INTERSECT returns common rows, and EXCEPT (MINUS) returns rows in the first query not in the second.

Higher-Order Functions and Aggregation Nuances

For arrays you do not always want to explode-then-regroup. Higher-order functions operate on array elements in place:

  • transform(arr, x -> x * 2) applies an expression to each element.
  • filter(arr, x -> x > 0) keeps elements matching a predicate.
  • aggregate(arr, 0, (acc, x) -> acc + x) reduces an array to a single value.
  • array_contains, size, element_at inspect arrays directly.

These keep transformations vectorized and avoid the row-explosion cost of explode.

Aggregation traps

  • COUNT(*) vs COUNT(col): COUNT(*) counts all rows including NULLs, while COUNT(col) ignores NULLs in that column — a classic distractor.
  • HAVING vs WHERE: WHERE filters rows before grouping; HAVING filters after aggregation (e.g., HAVING SUM(amount) > 1000).
  • GROUP BY with multiple grouping sets: GROUPING SETS, ROLLUP, and CUBE produce subtotals and grand totals in one pass, useful for gold reporting tables.
  • collect_list vs collect_set: both gather values into an array per group, but collect_set removes duplicates.

These functions and distinctions show up constantly in scenario questions where you must pick the SQL that produces a specific shaped result for a silver or gold table.

Parsing Semi-Structured Data

Bronze data often arrives as JSON strings or with nested payloads. Databricks SQL gives you functions to parse and flatten these into silver:

  • from_json(col, schema) parses a JSON string column into a STRUCT using a supplied schema.
  • schema_of_json(sample) infers a schema from a sample string.
  • to_json(struct) serializes a STRUCT back to a JSON string.
  • The colon operator (payload:user.id) extracts a value directly from a JSON string column without first parsing it into a STRUCT — a concise shortcut tested on the exam.
SELECT raw:event_type::string AS event_type,
       from_json(raw, 'struct<id:int,city:string>') AS parsed
FROM bronze_events;

The :: is a cast shorthand (raw:amount::double). Combined with explode for arrays and dot access for structs, these let you progressively flatten a bronze JSON blob into a clean, typed silver table.

Date and string essentials

Time and text cleansing dominate silver transformations: to_date / to_timestamp parse strings into temporal types, date_trunc('month', ts) rolls timestamps to a boundary, datediff and date_add do arithmetic, and regexp_extract / regexp_replace clean messy text with patterns. Knowing that to_date returns NULL on an unparseable string (rather than erroring) is a subtle point that affects how many rows survive a silver load.

Test Your Knowledge

You need every order even when no matching customer record exists, with customer columns filled as NULL for unmatched orders. Which join do you use?

A
B
C
D
Test Your Knowledge

Which function returns the FIRST non-NULL value from its list of arguments?

A
B
C
D
Test Your Knowledge

How do window functions differ from GROUP BY aggregations?

A
B
C
D
Test Your Knowledge

Which set operator returns rows that exist in the first query's result but NOT in the second's?

A
B
C
D