2.5 Complex Data Types and Higher-Order Functions

Key Takeaways

  • Spark complex types are ARRAY (ordered list), STRUCT (named nested fields accessed with dot notation), and MAP (key-value pairs).
  • explode() turns one array element per row; posexplode() adds the position index; both increase row count.
  • Higher-order functions transform, filter, exists, and aggregate apply a lambda to array elements inline without exploding and re-collecting.
  • Dot notation (col.field) reads struct fields, brackets (col[key] or col['k']) read map values or array indices.
  • Higher-order functions avoid the expensive explode-transform-collect_list shuffle pattern.
Last updated: June 2026

The Three Complex Types

Spark SQL supports three nested ('complex') data types that let a single column hold structured data:

TypeShapeAccess
ARRAYOrdered list of same-typed elementsarr[0], element_at(arr, 1)
STRUCTNamed fields, like a sub-records.field (dot notation)
MAPKey-value pairsm['key'], m.key

Nested JSON ingests naturally into these types — an object becomes a STRUCT, a list becomes an ARRAY. Struct fields are read with dot notation (address.city), array elements by index (items[0]), and map values by key (attributes['color']). You can construct them with array(), struct() / named_struct(), and map(), and inspect them with size() (length of array/map), array_contains(), and map_keys() / map_values().

Flattening with explode and posexplode

To turn array elements into individual rows, use explode(): a row with a 3-element array becomes three rows, each carrying one element. posexplode() does the same but also emits the zero-based position:

SELECT id, explode(tags) AS tag FROM events;
SELECT id, posexplode(tags) AS (pos, tag) FROM events;
FunctionOutput
explode(arr)One row per element
explode_outer(arr)Same, but keeps a null row for empty/null arrays
posexplode(arr)Element + position index
inline(arr<struct>)Explodes an array of structs into columns

These functions increase the row count, which is exactly what you want when normalizing nested data into a flat table. To go the other way — collapsing rows back into an array — use collect_list() or collect_set() in a GROUP BY.

Higher-Order Functions

Before higher-order functions, transforming array elements meant explode → transform rows → collect_list — verbose and shuffle-heavy. Higher-order functions apply a lambda to each element inline, preserving the row count:

FunctionPurposeExample
transform(arr, x -> expr)Map each elementtransform(nums, x -> x * 2)
filter(arr, x -> cond)Keep matching elementsfilter(nums, x -> x > 0)
exists(arr, x -> cond)True if any matchexists(nums, x -> x < 0)
aggregate(arr, start, (acc,x)->...)Fold to one valueaggregate(nums, 0, (a,x)->a+x)
SELECT transform(scores, s -> s + 5) AS curved FROM exams;

The lambda variable (x, s) names each element. Because no rows are exploded and re-collected, higher-order functions avoid an extra shuffle and are far more efficient for per-element work on arrays. aggregate (also called reduce) optionally takes a finalizer lambda to post-process the accumulated result.

Parsing JSON Strings and Working with Maps

Raw data often arrives with a JSON payload stored as a string column rather than as native nested types. Spark provides functions to parse and navigate it:

FunctionPurpose
from_json(col, schema)Parse a JSON string into a STRUCT/ARRAY using a supplied schema
to_json(col)Serialize a complex column back to a JSON string
schema_of_json(json)Infer a schema string from a sample JSON value
get_json_object(col, '$.a.b')Extract a single value with a JSON path
json_tuple(col, 'a', 'b')Extract several top-level keys at once

Use from_json when you want to fully materialize nested types for downstream querying, and get_json_object for quick one-off extraction without defining a schema.

Map columns deserve special attention: map_keys(m) and map_values(m) return arrays, element_at(m, 'k') (or m['k']) looks up a value, and explode(m) produces two columns — key and value. A common pattern is explode on a map to normalize key-value attributes into rows. For arrays, additional helpers include array_distinct, array_union, array_intersect, array_sort, sort_array, flatten (collapse an array of arrays), and slice.

Combining these utilities with higher-order functions lets you reshape deeply nested semi-structured data entirely in Spark SQL, without dropping to Python UDFs — which is both faster and the approach the exam expects you to recognize as best practice for nested-data manipulation.

Building and Querying Nested Structures

You also need to construct nested types, not just read them. struct(col1, col2) bundles columns into an anonymous struct, named_struct('a', x, 'b', y) gives the fields names, and selecting struct(*) packs a whole row into one column — handy before writing a nested Parquet/Delta layout. array(1,2,3) and map('k1', v1, 'k2', v2) build literal collections.

A few patterns recur on the exam:

  • Star expansion of a struct: SELECT address.* FROM t promotes every struct field to a top-level column, the inverse of bundling.
  • Indexing safely: element_at(arr, 1) is 1-based and returns null out of range, whereas arr[0] is 0-based — mixing them up is a classic mistake.
  • Counting elements: size(arr) and cardinality(map) return the number of entries; size returns -1 for null by default.
  • Filtering arrays in a predicate: WHERE array_contains(tags, 'urgent') keeps rows whose array holds a value, without exploding.

The broad principle the certification reinforces: prefer operating on complex types in place — with dot/bracket access, higher-order functions, and the array/map utility library — over flattening with explode and re-aggregating, because in-place operations preserve row counts and avoid shuffles. Reach for explode only when the downstream consumer genuinely needs one row per element.

Test Your Knowledge

Which function flattens an array column so each element becomes its own row, also returning the element's index?

A
B
C
D
Test Your Knowledge

How do you access the city field of a STRUCT column named address in Spark SQL?

A
B
C
D
Test Your Knowledge

Why is transform(nums, x -> x * 2) preferred over exploding, multiplying, and collect_list to double every array element?

A
B
C
D