3.3 Advanced SQL Transformations
Key Takeaways
- Multi-table JOINs (INNER, LEFT, RIGHT, FULL, CROSS, SEMI, ANTI) are fundamental for combining data across silver-layer tables.
- Date and timestamp functions (date_format, date_add, datediff, current_timestamp, to_date) are critical for time-based transformations.
- String functions (concat, substring, regexp_extract, trim, lower, upper, split, replace) handle text data cleansing.
- Conditional expressions (CASE WHEN, IF, COALESCE, NULLIF, NVL) implement business logic in SQL queries.
- UNION, INTERSECT, and EXCEPT combine or compare result sets from multiple queries.
Advanced SQL Transformations
Quick Answer: Master multi-table JOINs, date/string functions, CASE WHEN conditional logic, and set operations (UNION, INTERSECT, EXCEPT). These are the building blocks for silver-to-gold transformations in the medallion architecture.
Multi-Table JOINs
Join Types
| Join Type | Returns |
|---|---|
| INNER JOIN | Only rows with matches in both tables |
| LEFT JOIN | All rows from left table + matching rows from right |
| RIGHT JOIN | All rows from right table + matching rows from left |
| FULL OUTER JOIN | All rows from both tables (nulls where no match) |
| CROSS JOIN | Cartesian product — every row paired with every row |
| LEFT SEMI JOIN | Rows from left table that have a match in right (no right columns) |
| LEFT ANTI JOIN | Rows from left table that have NO match in right |
-- Multi-table join for gold layer aggregation
SELECT
o.order_date,
c.customer_name,
c.segment,
p.product_name,
p.category,
o.quantity,
o.total_amount
FROM silver.orders o
INNER JOIN silver.customers c ON o.customer_id = c.customer_id
INNER JOIN silver.products p ON o.product_id = p.product_id
WHERE o.order_date >= '2026-01-01';
SEMI and ANTI Joins
-- LEFT SEMI JOIN: Customers who have placed orders
SELECT * FROM customers c
LEFT SEMI JOIN orders o ON c.customer_id = o.customer_id;
-- LEFT ANTI JOIN: Customers who have NEVER placed orders
SELECT * FROM customers c
LEFT ANTI JOIN orders o ON c.customer_id = o.customer_id;
On the Exam: LEFT SEMI JOIN is like EXISTS — it returns rows from the left table that have at least one match, without duplicating rows. LEFT ANTI JOIN is like NOT EXISTS — it returns rows with no match.
Date and Timestamp Functions
SELECT
order_date,
-- Formatting
date_format(order_date, 'yyyy-MM-dd') AS formatted_date,
date_format(order_date, 'EEEE') AS day_of_week,
-- Arithmetic
date_add(order_date, 30) AS plus_30_days,
date_sub(order_date, 7) AS minus_7_days,
datediff(current_date(), order_date) AS days_since_order,
months_between(current_date(), order_date) AS months_since_order,
-- Extraction
year(order_date) AS order_year,
month(order_date) AS order_month,
dayofweek(order_date) AS day_num,
-- Current values
current_date() AS today,
current_timestamp() AS now,
-- Conversion
to_date('2026-03-31', 'yyyy-MM-dd') AS parsed_date,
to_timestamp('2026-03-31 14:30:00', 'yyyy-MM-dd HH:mm:ss') AS parsed_ts
FROM orders;
String Functions
SELECT
-- Case conversion
lower(customer_name) AS lower_name,
upper(customer_name) AS upper_name,
initcap(customer_name) AS title_case,
-- Trimming
trim(raw_input) AS trimmed,
ltrim(raw_input) AS left_trimmed,
rtrim(raw_input) AS right_trimmed,
-- Extraction and manipulation
substring(phone, 1, 3) AS area_code,
concat(first_name, ' ', last_name) AS full_name,
replace(phone, '-', '') AS clean_phone,
split(address, ',') AS address_parts,
-- Pattern matching
regexp_extract(email, '(.+)@(.+)', 1) AS email_username,
regexp_extract(email, '(.+)@(.+)', 2) AS email_domain,
-- Measurement
length(description) AS desc_length,
-- Null handling
COALESCE(middle_name, '') AS middle_name_safe
FROM customers;
Conditional Expressions
CASE WHEN
SELECT
order_id,
total_amount,
CASE
WHEN total_amount >= 1000 THEN 'Premium'
WHEN total_amount >= 100 THEN 'Standard'
ELSE 'Basic'
END AS order_tier
FROM orders;
IF Function
SELECT
customer_id,
IF(total_orders > 10, 'Frequent', 'Occasional') AS customer_type
FROM customer_summary;
Null-Handling Functions
| Function | Description | Example |
|---|---|---|
| COALESCE(a, b, c) | Returns the first non-null value | COALESCE(phone, email, 'N/A') |
| NVL(a, b) | Returns b if a is null | NVL(middle_name, '') |
| NULLIF(a, b) | Returns null if a equals b | NULLIF(status, 'unknown') |
| IFNULL(a, b) | Same as NVL | IFNULL(discount, 0) |
Set Operations
-- UNION ALL: combine results (keep duplicates)
SELECT customer_id FROM online_orders
UNION ALL
SELECT customer_id FROM store_orders;
-- UNION: combine results (remove duplicates)
SELECT customer_id FROM online_orders
UNION
SELECT customer_id FROM store_orders;
-- INTERSECT: rows in both queries
SELECT customer_id FROM online_orders
INTERSECT
SELECT customer_id FROM store_orders;
-- EXCEPT: rows in first query but not second
SELECT customer_id FROM all_customers
EXCEPT
SELECT customer_id FROM churned_customers;
On the Exam: UNION ALL is preferred over UNION when duplicates are acceptable because UNION requires an expensive deduplication step. Use EXCEPT to find rows that exist in one dataset but not another.
A data engineer wants to find all customers who have NEVER placed an order. Which join type should they use?
What does the COALESCE function return when given the arguments COALESCE(NULL, NULL, 'default', 'backup')?
Which set operation combines two result sets while removing duplicate rows?