2.5 Complex Data Types and Higher-Order Functions

Key Takeaways

Spark complex types are ARRAY (ordered list), STRUCT (named nested fields accessed with dot notation), and MAP (key-value pairs).
explode() turns one array element per row; posexplode() adds the position index; both increase row count.
Higher-order functions transform, filter, exists, and aggregate apply a lambda to array elements inline without exploding and re-collecting.
Dot notation (col.field) reads struct fields, brackets (col[key] or col['k']) read map values or array indices.
Higher-order functions avoid the expensive explode-transform-collect_list shuffle pattern.

Last updated: June 2026

The Three Complex Types

Spark SQL supports three nested ('complex') data types that let a single column hold structured data:

Type	Shape	Access
ARRAY	Ordered list of same-typed elements	`arr[0]`, `element_at(arr, 1)`
STRUCT	Named fields, like a sub-record	`s.field` (dot notation)
MAP	Key-value pairs	`m['key']`, `m.key`

Nested JSON ingests naturally into these types — an object becomes a STRUCT, a list becomes an ARRAY. Struct fields are read with dot notation (address.city), array elements by index (items[0]), and map values by key (attributes['color']). You can construct them with array(), struct() / named_struct(), and map(), and inspect them with size() (length of array/map), array_contains(), and map_keys() / map_values().

Flattening with explode and posexplode

To turn array elements into individual rows, use explode(): a row with a 3-element array becomes three rows, each carrying one element. posexplode() does the same but also emits the zero-based position:

SELECT id, explode(tags) AS tag FROM events;
SELECT id, posexplode(tags) AS (pos, tag) FROM events;

Function	Output
`explode(arr)`	One row per element
`explode_outer(arr)`	Same, but keeps a null row for empty/null arrays
`posexplode(arr)`	Element + position index
`inline(arr<struct>)`	Explodes an array of structs into columns

These functions increase the row count, which is exactly what you want when normalizing nested data into a flat table. To go the other way — collapsing rows back into an array — use collect_list() or collect_set() in a GROUP BY.

Higher-Order Functions

Before higher-order functions, transforming array elements meant explode → transform rows → collect_list — verbose and shuffle-heavy. Higher-order functions apply a lambda to each element inline, preserving the row count:

Function	Purpose	Example
`transform(arr, x -> expr)`	Map each element	`transform(nums, x -> x * 2)`
`filter(arr, x -> cond)`	Keep matching elements	`filter(nums, x -> x > 0)`
`exists(arr, x -> cond)`	True if any match	`exists(nums, x -> x < 0)`
`aggregate(arr, start, (acc,x)->...)`	Fold to one value	`aggregate(nums, 0, (a,x)->a+x)`

SELECT transform(scores, s -> s + 5) AS curved FROM exams;

The lambda variable (x, s) names each element. Because no rows are exploded and re-collected, higher-order functions avoid an extra shuffle and are far more efficient for per-element work on arrays. aggregate (also called reduce) optionally takes a finalizer lambda to post-process the accumulated result.

Parsing JSON Strings and Working with Maps

Raw data often arrives with a JSON payload stored as a string column rather than as native nested types. Spark provides functions to parse and navigate it:

Function	Purpose
`from_json(col, schema)`	Parse a JSON string into a STRUCT/ARRAY using a supplied schema
`to_json(col)`	Serialize a complex column back to a JSON string
`schema_of_json(json)`	Infer a schema string from a sample JSON value
`get_json_object(col, '$.a.b')`	Extract a single value with a JSON path
`json_tuple(col, 'a', 'b')`	Extract several top-level keys at once

Use from_json when you want to fully materialize nested types for downstream querying, and get_json_object for quick one-off extraction without defining a schema.

Map columns deserve special attention: map_keys(m) and map_values(m) return arrays, element_at(m, 'k') (or m['k']) looks up a value, and explode(m) produces two columns — key and value. A common pattern is explode on a map to normalize key-value attributes into rows. For arrays, additional helpers include array_distinct, array_union, array_intersect, array_sort, sort_array, flatten (collapse an array of arrays), and slice.

Combining these utilities with higher-order functions lets you reshape deeply nested semi-structured data entirely in Spark SQL, without dropping to Python UDFs — which is both faster and the approach the exam expects you to recognize as best practice for nested-data manipulation.

Building and Querying Nested Structures

You also need to construct nested types, not just read them. struct(col1, col2) bundles columns into an anonymous struct, named_struct('a', x, 'b', y) gives the fields names, and selecting struct(*) packs a whole row into one column — handy before writing a nested Parquet/Delta layout. array(1,2,3) and map('k1', v1, 'k2', v2) build literal collections.

A few patterns recur on the exam:

Star expansion of a struct: SELECT address.* FROM t promotes every struct field to a top-level column, the inverse of bundling.
Indexing safely: element_at(arr, 1) is 1-based and returns null out of range, whereas arr[0] is 0-based — mixing them up is a classic mistake.
Counting elements: size(arr) and cardinality(map) return the number of entries; size returns -1 for null by default.
Filtering arrays in a predicate: WHERE array_contains(tags, 'urgent') keeps rows whose array holds a value, without exploding.

The broad principle the certification reinforces: prefer operating on complex types in place — with dot/bracket access, higher-order functions, and the array/map utility library — over flattening with explode and re-aggregating, because in-place operations preserve row counts and avoid shuffles. Reach for explode only when the downstream consumer genuinely needs one row per element.

Test Your Knowledge

Which function flattens an array column so each element becomes its own row, also returning the element's index?

explode()

collect_list()

posexplode()

transform()

Test Your Knowledge

How do you access the city field of a STRUCT column named address in Spark SQL?

address.city

address['city']

address[0]

explode(address)

Test Your Knowledge

Why is transform(nums, x -> x * 2) preferred over exploding, multiplying, and collect_list to double every array element?

transform converts the array to a STRUCT automatically

transform is the only way to multiply numbers in Spark

transform guarantees the result is sorted

transform applies the lambda inline without exploding rows, avoiding an extra shuffle

Up Next

2.6 User-Defined Functions (UDFs)

Continue learning

Databricks Certified Data Engineer Associate

Databricks Certified Data Engineer Associate

2.5 Complex Data Types and Higher-Order Functions

Key Takeaways

The Three Complex Types

Flattening with explode and posexplode

Higher-Order Functions

Parsing JSON Strings and Working with Maps

Building and Querying Nested Structures

Databricks Certified Data Engineer Associate

1Introduction

2Domain 1: Databricks Intelligence Platform (10%)

3Domain 2: Development and Ingestion (30%)

4Domain 3: Data Processing & Transformations (31%)

5Domain 4: Productionizing Data Pipelines (18%)

6Domain 5: Data Governance & Quality (11%)

Databricks Certified Data Engineer Associate

2.5 Complex Data Types and Higher-Order Functions

Key Takeaways

The Three Complex Types

Flattening with explode and posexplode

Higher-Order Functions

Parsing JSON Strings and Working with Maps

Building and Querying Nested Structures