2.5 Complex Data Types and Higher-Order Functions
Key Takeaways
- Spark SQL supports complex data types including arrays, maps, and structs for handling nested and semi-structured data.
- Higher-order functions (TRANSFORM, FILTER, EXISTS, REDUCE, AGGREGATE) operate on arrays without exploding them.
- The explode() function converts array elements into separate rows, while collect_list()/collect_set() aggregate rows back into arrays.
- The colon notation (column:nested_field) and dot notation (column.nested_field) access nested fields in structs and JSON data.
- The from_json() and to_json() functions convert between JSON strings and Spark struct types for semi-structured data processing.
Complex Data Types and Higher-Order Functions
Quick Answer: Spark handles nested data with arrays, maps, and structs. Higher-order functions (TRANSFORM, FILTER, EXISTS) process arrays inline without exploding. Use colon notation for JSON access and explode/collect for row-level transformations.
Complex Data Types
Arrays
An ordered collection of elements of the same type:
-- Create an array
SELECT array(1, 2, 3, 4, 5) AS numbers;
-- Access array elements (0-indexed)
SELECT items[0] AS first_item FROM orders;
-- Array functions
SELECT
size(items) AS item_count, -- Number of elements
array_contains(items, 'laptop') AS has_laptop, -- Check membership
array_distinct(items) AS unique_items, -- Remove duplicates
flatten(nested_arrays) AS flat_array, -- Flatten nested arrays
array_union(arr1, arr2) AS combined -- Union of two arrays
FROM orders;
Maps
A collection of key-value pairs:
-- Create a map
SELECT map('name', 'Alice', 'age', '30') AS user_info;
-- Access map values
SELECT user_info['name'] AS name FROM users;
-- Map functions
SELECT
map_keys(properties) AS all_keys,
map_values(properties) AS all_values,
map_from_arrays(key_array, value_array) AS new_map
FROM events;
Structs
A named collection of fields (like a row within a row):
-- Create a struct
SELECT struct(1 AS id, 'Alice' AS name, 30 AS age) AS person;
-- Access struct fields
SELECT address.city, address.state, address.zip
FROM customers;
Accessing Nested Data
JSON Colon Notation (for variant/string JSON columns)
-- Access nested JSON fields using colon notation
SELECT
raw_data:customer:name AS customer_name,
raw_data:customer:email AS customer_email,
raw_data:items[0]:product AS first_product
FROM raw_events;
Dot Notation (for struct columns)
-- Access struct fields using dot notation
SELECT
customer.name AS customer_name,
customer.address.city AS city
FROM parsed_events;
Higher-Order Functions
Higher-order functions process array elements without exploding them into separate rows:
TRANSFORM
Apply a function to every element in an array:
-- Double every element in an array
SELECT TRANSFORM(numbers, x -> x * 2) AS doubled
FROM data;
-- Convert array of strings to uppercase
SELECT TRANSFORM(names, x -> upper(x)) AS upper_names
FROM data;
-- Calculate tax for each item price
SELECT TRANSFORM(prices, p -> p * 1.08) AS prices_with_tax
FROM orders;
FILTER
Keep only elements that match a condition:
-- Keep only positive numbers
SELECT FILTER(numbers, x -> x > 0) AS positive_numbers
FROM data;
-- Keep items over \$100
SELECT FILTER(items, i -> i.price > 100) AS expensive_items
FROM orders;
EXISTS
Check if any element satisfies a condition (returns boolean):
-- Check if any item costs more than \$500
SELECT EXISTS(items, i -> i.price > 500) AS has_expensive_item
FROM orders;
REDUCE / AGGREGATE
Combine all elements into a single value:
-- Sum all elements in an array
SELECT REDUCE(numbers, 0, (acc, x) -> acc + x) AS total
FROM data;
-- Concatenate array elements
SELECT AGGREGATE(names, '', (acc, x) -> concat(acc, ', ', x)) AS all_names
FROM data;
Explode and Collect
explode() — Arrays to Rows
-- Each array element becomes a separate row
SELECT order_id, explode(items) AS item
FROM orders;
-- With position
SELECT order_id, posexplode(items) AS (position, item)
FROM orders;
collect_list() / collect_set() — Rows to Arrays
-- Aggregate rows back into an array
SELECT
customer_id,
collect_list(product_name) AS all_products, -- Allows duplicates
collect_set(product_category) AS unique_categories -- No duplicates
FROM orders
GROUP BY customer_id;
Semi-Structured Data Functions
-- Parse JSON string into struct
SELECT from_json(json_string, 'name STRING, age INT') AS parsed
FROM raw_data;
-- Convert struct to JSON string
SELECT to_json(struct_column) AS json_string
FROM parsed_data;
-- Parse JSON with schema_of_json
SELECT from_json(json_col, schema_of_json('{"name":"str","age":1}')) AS parsed
FROM raw_data;
-- Flatten complex nested structures
SELECT inline(array_of_structs) FROM data;
On the Exam: Higher-order functions are preferred over explode + group by + collect for performance. Know the difference between TRANSFORM (changes elements), FILTER (selects elements), and EXISTS (boolean check).
A data engineer has an array column "prices" and needs to keep only values greater than 100 without exploding the array. Which function should they use?
What is the difference between collect_list() and collect_set() in Spark SQL?
Which notation is used to access nested fields in a JSON string column in Databricks SQL?