A data engineer has an array column "prices" and needs to keep only values greater than 100 without exploding the array. Which function should they use?

FILTER(prices, x -> x > 100). FILTER keeps only array elements that satisfy the condition, returning a new array. TRANSFORM changes each element (returns same-length array). EXISTS returns a boolean (true/false). REDUCE combines all elements into a single value.

What is the difference between collect_list() and collect_set() in Spark SQL?

collect_list() preserves duplicates; collect_set() removes duplicates. collect_list() aggregates values into an array that may contain duplicates. collect_set() aggregates values into an array with duplicates removed. Both are used to convert grouped rows back into array form.

Which notation is used to access nested fields in a JSON string column in Databricks SQL?

Colon notation: column:field. Colon notation (column:field) is used to access nested fields within JSON string or variant columns. Dot notation (column.field) is used for struct columns that have already been parsed into a typed schema.

Complex Data Types and Higher-Order Functions

Quick Answer: Spark handles nested data with arrays, maps, and structs. Higher-order functions (TRANSFORM, FILTER, EXISTS) process arrays inline without exploding. Use colon notation for JSON access and explode/collect for row-level transformations.

Complex Data Types

Arrays

An ordered collection of elements of the same type:

-- Create an array
SELECT array(1, 2, 3, 4, 5) AS numbers;

-- Access array elements (0-indexed)
SELECT items[0] AS first_item FROM orders;

-- Array functions
SELECT
    size(items) AS item_count,           -- Number of elements
    array_contains(items, 'laptop') AS has_laptop,  -- Check membership
    array_distinct(items) AS unique_items,           -- Remove duplicates
    flatten(nested_arrays) AS flat_array,            -- Flatten nested arrays
    array_union(arr1, arr2) AS combined              -- Union of two arrays
FROM orders;

Maps

A collection of key-value pairs:

-- Create a map
SELECT map('name', 'Alice', 'age', '30') AS user_info;

-- Access map values
SELECT user_info['name'] AS name FROM users;

-- Map functions
SELECT
    map_keys(properties) AS all_keys,
    map_values(properties) AS all_values,
    map_from_arrays(key_array, value_array) AS new_map
FROM events;

Structs

A named collection of fields (like a row within a row):

-- Create a struct
SELECT struct(1 AS id, 'Alice' AS name, 30 AS age) AS person;

-- Access struct fields
SELECT address.city, address.state, address.zip
FROM customers;

Accessing Nested Data

JSON Colon Notation (for variant/string JSON columns)

-- Access nested JSON fields using colon notation
SELECT
    raw_data:customer:name AS customer_name,
    raw_data:customer:email AS customer_email,
    raw_data:items[0]:product AS first_product
FROM raw_events;

Dot Notation (for struct columns)

-- Access struct fields using dot notation
SELECT
    customer.name AS customer_name,
    customer.address.city AS city
FROM parsed_events;

Higher-Order Functions

Higher-order functions process array elements without exploding them into separate rows:

TRANSFORM

Apply a function to every element in an array:

-- Double every element in an array
SELECT TRANSFORM(numbers, x -> x * 2) AS doubled
FROM data;

-- Convert array of strings to uppercase
SELECT TRANSFORM(names, x -> upper(x)) AS upper_names
FROM data;

-- Calculate tax for each item price
SELECT TRANSFORM(prices, p -> p * 1.08) AS prices_with_tax
FROM orders;

FILTER

Keep only elements that match a condition:

-- Keep only positive numbers
SELECT FILTER(numbers, x -> x > 0) AS positive_numbers
FROM data;

-- Keep items over \$100
SELECT FILTER(items, i -> i.price > 100) AS expensive_items
FROM orders;

EXISTS

Check if any element satisfies a condition (returns boolean):

-- Check if any item costs more than \$500
SELECT EXISTS(items, i -> i.price > 500) AS has_expensive_item
FROM orders;

REDUCE / AGGREGATE

Combine all elements into a single value:

-- Sum all elements in an array
SELECT REDUCE(numbers, 0, (acc, x) -> acc + x) AS total
FROM data;

-- Concatenate array elements
SELECT AGGREGATE(names, '', (acc, x) -> concat(acc, ', ', x)) AS all_names
FROM data;

Explode and Collect

explode() — Arrays to Rows

-- Each array element becomes a separate row
SELECT order_id, explode(items) AS item
FROM orders;

-- With position
SELECT order_id, posexplode(items) AS (position, item)
FROM orders;

collect_list() / collect_set() — Rows to Arrays

-- Aggregate rows back into an array
SELECT
    customer_id,
    collect_list(product_name) AS all_products,     -- Allows duplicates
    collect_set(product_category) AS unique_categories  -- No duplicates
FROM orders
GROUP BY customer_id;

Semi-Structured Data Functions

-- Parse JSON string into struct
SELECT from_json(json_string, 'name STRING, age INT') AS parsed
FROM raw_data;

-- Convert struct to JSON string
SELECT to_json(struct_column) AS json_string
FROM parsed_data;

-- Parse JSON with schema_of_json
SELECT from_json(json_col, schema_of_json('{"name":"str","age":1}')) AS parsed
FROM raw_data;

-- Flatten complex nested structures
SELECT inline(array_of_structs) FROM data;

On the Exam: Higher-order functions are preferred over explode + group by + collect for performance. Know the difference between TRANSFORM (changes elements), FILTER (selects elements), and EXISTS (boolean check).

Databricks Certified Data Engineer Associate

2.5 Complex Data Types and Higher-Order Functions

Key Takeaways

Complex Data Types and Higher-Order Functions

Complex Data Types

Arrays

Maps

Structs

Accessing Nested Data

JSON Colon Notation (for variant/string JSON columns)

Dot Notation (for struct columns)

Higher-Order Functions

TRANSFORM

FILTER

EXISTS

REDUCE / AGGREGATE

Explode and Collect

explode() — Arrays to Rows

collect_list() / collect_set() — Rows to Arrays

Semi-Structured Data Functions

Databricks Certified Data Engineer Associate

1Introduction

2Domain 1: Databricks Intelligence Platform (10%)

3Domain 2: Development and Ingestion (30%)

4Domain 3: Data Processing & Transformations (31%)

5Domain 4: Productionizing Data Pipelines (18%)

6Domain 5: Data Governance & Quality (11%)

2.5 Complex Data Types and Higher-Order Functions

Key Takeaways

Complex Data Types and Higher-Order Functions

Complex Data Types

Arrays

Maps

Structs

Accessing Nested Data

JSON Colon Notation (for variant/string JSON columns)

Dot Notation (for struct columns)

Higher-Order Functions

TRANSFORM

FILTER

EXISTS

REDUCE / AGGREGATE

Explode and Collect

explode() — Arrays to Rows

collect_list() / collect_set() — Rows to Arrays

Semi-Structured Data Functions