3.3 Advanced SQL Transformations

Key Takeaways

  • Multi-table JOINs (INNER, LEFT, RIGHT, FULL, CROSS, SEMI, ANTI) are fundamental for combining data across silver-layer tables.
  • Date and timestamp functions (date_format, date_add, datediff, current_timestamp, to_date) are critical for time-based transformations.
  • String functions (concat, substring, regexp_extract, trim, lower, upper, split, replace) handle text data cleansing.
  • Conditional expressions (CASE WHEN, IF, COALESCE, NULLIF, NVL) implement business logic in SQL queries.
  • UNION, INTERSECT, and EXCEPT combine or compare result sets from multiple queries.
Last updated: March 2026

Advanced SQL Transformations

Quick Answer: Master multi-table JOINs, date/string functions, CASE WHEN conditional logic, and set operations (UNION, INTERSECT, EXCEPT). These are the building blocks for silver-to-gold transformations in the medallion architecture.

Multi-Table JOINs

Join Types

Join TypeReturns
INNER JOINOnly rows with matches in both tables
LEFT JOINAll rows from left table + matching rows from right
RIGHT JOINAll rows from right table + matching rows from left
FULL OUTER JOINAll rows from both tables (nulls where no match)
CROSS JOINCartesian product — every row paired with every row
LEFT SEMI JOINRows from left table that have a match in right (no right columns)
LEFT ANTI JOINRows from left table that have NO match in right
-- Multi-table join for gold layer aggregation
SELECT
    o.order_date,
    c.customer_name,
    c.segment,
    p.product_name,
    p.category,
    o.quantity,
    o.total_amount
FROM silver.orders o
INNER JOIN silver.customers c ON o.customer_id = c.customer_id
INNER JOIN silver.products p ON o.product_id = p.product_id
WHERE o.order_date >= '2026-01-01';

SEMI and ANTI Joins

-- LEFT SEMI JOIN: Customers who have placed orders
SELECT * FROM customers c
LEFT SEMI JOIN orders o ON c.customer_id = o.customer_id;

-- LEFT ANTI JOIN: Customers who have NEVER placed orders
SELECT * FROM customers c
LEFT ANTI JOIN orders o ON c.customer_id = o.customer_id;

On the Exam: LEFT SEMI JOIN is like EXISTS — it returns rows from the left table that have at least one match, without duplicating rows. LEFT ANTI JOIN is like NOT EXISTS — it returns rows with no match.

Date and Timestamp Functions

SELECT
    order_date,

    -- Formatting
    date_format(order_date, 'yyyy-MM-dd') AS formatted_date,
    date_format(order_date, 'EEEE') AS day_of_week,

    -- Arithmetic
    date_add(order_date, 30) AS plus_30_days,
    date_sub(order_date, 7) AS minus_7_days,
    datediff(current_date(), order_date) AS days_since_order,
    months_between(current_date(), order_date) AS months_since_order,

    -- Extraction
    year(order_date) AS order_year,
    month(order_date) AS order_month,
    dayofweek(order_date) AS day_num,

    -- Current values
    current_date() AS today,
    current_timestamp() AS now,

    -- Conversion
    to_date('2026-03-31', 'yyyy-MM-dd') AS parsed_date,
    to_timestamp('2026-03-31 14:30:00', 'yyyy-MM-dd HH:mm:ss') AS parsed_ts
FROM orders;

String Functions

SELECT
    -- Case conversion
    lower(customer_name) AS lower_name,
    upper(customer_name) AS upper_name,
    initcap(customer_name) AS title_case,

    -- Trimming
    trim(raw_input) AS trimmed,
    ltrim(raw_input) AS left_trimmed,
    rtrim(raw_input) AS right_trimmed,

    -- Extraction and manipulation
    substring(phone, 1, 3) AS area_code,
    concat(first_name, ' ', last_name) AS full_name,
    replace(phone, '-', '') AS clean_phone,
    split(address, ',') AS address_parts,

    -- Pattern matching
    regexp_extract(email, '(.+)@(.+)', 1) AS email_username,
    regexp_extract(email, '(.+)@(.+)', 2) AS email_domain,

    -- Measurement
    length(description) AS desc_length,

    -- Null handling
    COALESCE(middle_name, '') AS middle_name_safe
FROM customers;

Conditional Expressions

CASE WHEN

SELECT
    order_id,
    total_amount,
    CASE
        WHEN total_amount >= 1000 THEN 'Premium'
        WHEN total_amount >= 100 THEN 'Standard'
        ELSE 'Basic'
    END AS order_tier
FROM orders;

IF Function

SELECT
    customer_id,
    IF(total_orders > 10, 'Frequent', 'Occasional') AS customer_type
FROM customer_summary;

Null-Handling Functions

FunctionDescriptionExample
COALESCE(a, b, c)Returns the first non-null valueCOALESCE(phone, email, 'N/A')
NVL(a, b)Returns b if a is nullNVL(middle_name, '')
NULLIF(a, b)Returns null if a equals bNULLIF(status, 'unknown')
IFNULL(a, b)Same as NVLIFNULL(discount, 0)

Set Operations

-- UNION ALL: combine results (keep duplicates)
SELECT customer_id FROM online_orders
UNION ALL
SELECT customer_id FROM store_orders;

-- UNION: combine results (remove duplicates)
SELECT customer_id FROM online_orders
UNION
SELECT customer_id FROM store_orders;

-- INTERSECT: rows in both queries
SELECT customer_id FROM online_orders
INTERSECT
SELECT customer_id FROM store_orders;

-- EXCEPT: rows in first query but not second
SELECT customer_id FROM all_customers
EXCEPT
SELECT customer_id FROM churned_customers;

On the Exam: UNION ALL is preferred over UNION when duplicates are acceptable because UNION requires an expensive deduplication step. Use EXCEPT to find rows that exist in one dataset but not another.

Test Your Knowledge

A data engineer wants to find all customers who have NEVER placed an order. Which join type should they use?

A
B
C
D
Test Your Knowledge

What does the COALESCE function return when given the arguments COALESCE(NULL, NULL, 'default', 'backup')?

A
B
C
D
Test Your Knowledge

Which set operation combines two result sets while removing duplicate rows?

A
B
C
D