2.5 Complex Data Types and Higher-Order Functions
Key Takeaways
- Spark complex types are ARRAY (ordered list), STRUCT (named nested fields accessed with dot notation), and MAP (key-value pairs).
- explode() turns one array element per row; posexplode() adds the position index; both increase row count.
- Higher-order functions transform, filter, exists, and aggregate apply a lambda to array elements inline without exploding and re-collecting.
- Dot notation (col.field) reads struct fields, brackets (col[key] or col['k']) read map values or array indices.
- Higher-order functions avoid the expensive explode-transform-collect_list shuffle pattern.
The Three Complex Types
Spark SQL supports three nested ('complex') data types that let a single column hold structured data:
| Type | Shape | Access |
|---|---|---|
| ARRAY | Ordered list of same-typed elements | arr[0], element_at(arr, 1) |
| STRUCT | Named fields, like a sub-record | s.field (dot notation) |
| MAP | Key-value pairs | m['key'], m.key |
Nested JSON ingests naturally into these types — an object becomes a STRUCT, a list becomes an ARRAY. Struct fields are read with dot notation (address.city), array elements by index (items[0]), and map values by key (attributes['color']). You can construct them with array(), struct() / named_struct(), and map(), and inspect them with size() (length of array/map), array_contains(), and map_keys() / map_values().
Flattening with explode and posexplode
To turn array elements into individual rows, use explode(): a row with a 3-element array becomes three rows, each carrying one element. posexplode() does the same but also emits the zero-based position:
SELECT id, explode(tags) AS tag FROM events;
SELECT id, posexplode(tags) AS (pos, tag) FROM events;
| Function | Output |
|---|---|
explode(arr) | One row per element |
explode_outer(arr) | Same, but keeps a null row for empty/null arrays |
posexplode(arr) | Element + position index |
inline(arr<struct>) | Explodes an array of structs into columns |
These functions increase the row count, which is exactly what you want when normalizing nested data into a flat table. To go the other way — collapsing rows back into an array — use collect_list() or collect_set() in a GROUP BY.
Higher-Order Functions
Before higher-order functions, transforming array elements meant explode → transform rows → collect_list — verbose and shuffle-heavy. Higher-order functions apply a lambda to each element inline, preserving the row count:
| Function | Purpose | Example |
|---|---|---|
transform(arr, x -> expr) | Map each element | transform(nums, x -> x * 2) |
filter(arr, x -> cond) | Keep matching elements | filter(nums, x -> x > 0) |
exists(arr, x -> cond) | True if any match | exists(nums, x -> x < 0) |
aggregate(arr, start, (acc,x)->...) | Fold to one value | aggregate(nums, 0, (a,x)->a+x) |
SELECT transform(scores, s -> s + 5) AS curved FROM exams;
The lambda variable (x, s) names each element. Because no rows are exploded and re-collected, higher-order functions avoid an extra shuffle and are far more efficient for per-element work on arrays. aggregate (also called reduce) optionally takes a finalizer lambda to post-process the accumulated result.
Parsing JSON Strings and Working with Maps
Raw data often arrives with a JSON payload stored as a string column rather than as native nested types. Spark provides functions to parse and navigate it:
| Function | Purpose |
|---|---|
from_json(col, schema) | Parse a JSON string into a STRUCT/ARRAY using a supplied schema |
to_json(col) | Serialize a complex column back to a JSON string |
schema_of_json(json) | Infer a schema string from a sample JSON value |
get_json_object(col, '$.a.b') | Extract a single value with a JSON path |
json_tuple(col, 'a', 'b') | Extract several top-level keys at once |
Use from_json when you want to fully materialize nested types for downstream querying, and get_json_object for quick one-off extraction without defining a schema.
Map columns deserve special attention: map_keys(m) and map_values(m) return arrays, element_at(m, 'k') (or m['k']) looks up a value, and explode(m) produces two columns — key and value. A common pattern is explode on a map to normalize key-value attributes into rows. For arrays, additional helpers include array_distinct, array_union, array_intersect, array_sort, sort_array, flatten (collapse an array of arrays), and slice.
Combining these utilities with higher-order functions lets you reshape deeply nested semi-structured data entirely in Spark SQL, without dropping to Python UDFs — which is both faster and the approach the exam expects you to recognize as best practice for nested-data manipulation.
Building and Querying Nested Structures
You also need to construct nested types, not just read them. struct(col1, col2) bundles columns into an anonymous struct, named_struct('a', x, 'b', y) gives the fields names, and selecting struct(*) packs a whole row into one column — handy before writing a nested Parquet/Delta layout. array(1,2,3) and map('k1', v1, 'k2', v2) build literal collections.
A few patterns recur on the exam:
- Star expansion of a struct:
SELECT address.* FROM tpromotes every struct field to a top-level column, the inverse of bundling. - Indexing safely:
element_at(arr, 1)is 1-based and returns null out of range, whereasarr[0]is 0-based — mixing them up is a classic mistake. - Counting elements:
size(arr)andcardinality(map)return the number of entries;sizereturns -1 for null by default. - Filtering arrays in a predicate:
WHERE array_contains(tags, 'urgent')keeps rows whose array holds a value, without exploding.
The broad principle the certification reinforces: prefer operating on complex types in place — with dot/bracket access, higher-order functions, and the array/map utility library — over flattening with explode and re-aggregating, because in-place operations preserve row counts and avoid shuffles. Reach for explode only when the downstream consumer genuinely needs one row per element.
Which function flattens an array column so each element becomes its own row, also returning the element's index?
How do you access the city field of a STRUCT column named address in Spark SQL?
Why is transform(nums, x -> x * 2) preferred over exploding, multiplying, and collect_list to double every array element?