2.6 User-Defined Functions (UDFs)

Key Takeaways

A UDF extends Spark with custom row-level logic when built-in functions are insufficient; register with spark.udf.register or the @udf decorator.
SQL UDFs (CREATE FUNCTION) stay inside the engine and are optimized by Catalyst, making them the fastest UDF type.
Python UDFs incur serialization and per-row JVM-to-Python overhead and are opaque to Catalyst, so they are the slowest.
Pandas (vectorized) UDFs process data in Arrow batches, dramatically faster than row-at-a-time Python UDFs.
Always prefer built-in functions over UDFs; reach for UDFs only when no native function exists.

Last updated: June 2026

What a UDF Is and When to Use One

A user-defined function (UDF) lets you apply custom logic to each row of a DataFrame when Spark's rich library of built-in functions cannot express what you need. The cardinal rule, tested repeatedly: prefer built-in functions. Built-ins are implemented natively, understood by the Catalyst optimizer, and run inside the JVM with no serialization penalty. A UDF should be the last resort, used only when no native function exists for your transformation.

In PySpark you create one by wrapping a Python function, declaring its return type, and registering it:

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

@udf(returnType=StringType())
def initials(name):
    return ''.join(w[0] for w in name.split())

df.select(initials('full_name'))

To call it from Spark SQL, register it by name: spark.udf.register('initials', initials), then SELECT initials(full_name) FROM people.

UDF Types and Performance

Not all UDFs cost the same. The exam expects you to rank them by performance:

UDF type	Optimized by Catalyst?	Speed
SQL UDF (`CREATE FUNCTION`)	Yes — stays in the engine	Fastest
Pandas / vectorized UDF	Partially (Arrow batches)	Fast
Python UDF (row-at-a-time)	No — opaque black box	Slowest

A SQL UDF is defined entirely in SQL and inlined, so Catalyst can optimize it like any expression:

CREATE FUNCTION fahrenheit(c DOUBLE)
RETURNS DOUBLE RETURN c * 9/5 + 32;

A plain Python UDF serializes each row from the JVM to a Python worker, runs your function, and serializes the result back — per row. This crossing of the JVM-Python boundary plus Catalyst's inability to see inside the function makes it the slowest option. A Pandas UDF (decorated with @pandas_udf) instead moves data in Apache Arrow batches and operates on whole pandas.Series at once, eliminating most per-row overhead and running far faster than a scalar Python UDF.

SQL UDFs, Registration, and Governance

In Unity Catalog, a SQL UDF is a first-class, governed object: it is created in a schema (catalog.schema.function), can be granted to users, and is shareable across notebooks and warehouses. Its body is a single SQL expression, so the optimizer inlines it directly into the query plan:

CREATE OR REPLACE FUNCTION main.util.mask_email(e STRING)
RETURNS STRING
RETURN concat(left(e,1), '***@', split(e,'@')[1]);

SELECT main.util.mask_email(email) FROM users;

Because it is just SQL, it carries no serialization cost and benefits from predicate pushdown and constant folding. This is why the exam steers you toward SQL UDFs (or built-ins) over Python UDFs whenever the logic can be expressed declaratively.

A Python UDF, by contrast, must declare its return type explicitly — if you omit or mis-specify it, Spark returns nulls or errors, a classic gotcha. To call a Python UDF from SQL you must register it by name with spark.udf.register('name', fn, returnType). Note also that UDFs run on the executors, not the driver, and that exceptions inside a UDF fail the whole task — so defensive coding (handling nulls, bad input) matters.

Choosing the Right Tool

The decision hierarchy the exam rewards is:

Built-in function — always first; native, vectorized, Catalyst-aware.
SQL UDF — when logic is declarative but not covered by a built-in.
Pandas (vectorized) UDF — when you need Python/library logic at scale; Arrow batching keeps it fast.
Scalar Python UDF — last resort for genuinely row-by-row custom logic.

Concern	Built-in / SQL UDF	Python UDF
Catalyst optimization	Yes	No (opaque)
Serialization cost	None	Per-row JVM↔Python
Predicate pushdown	Preserved	Often blocked
Best for	Standard transforms	Niche custom logic

A real-world trap: wrapping simple string or math operations in a Python UDF when concat, regexp_replace, split, or round already exist. This needlessly disables optimization and slows the job many-fold. Always search the pyspark.sql.functions library (and Spark SQL functions list) before writing any UDF — the right answer to 'how do I transform this column?' is almost always a built-in, occasionally a SQL UDF, and only rarely a Python UDF.

Pandas UDF Variants and Determinism

The Pandas UDF family (also called vectorized UDFs) comes in a few shapes worth recognizing. A Series-to-Series Pandas UDF takes one or more pandas.Series and returns a Series of the same length — ideal for elementwise math or string work with NumPy/pandas. An iterator variant streams Arrow batches for expensive one-time setup (like loading an ML model once per partition). A grouped map applies a function to each group of a groupBy and can change the row count, useful for per-group transformations like normalization.

Two governance points round out the topic. First, UDFs should be deterministic — the same input must yield the same output — because Spark may evaluate, re-evaluate, or reorder them; a UDF calling random() or the current time can produce inconsistent results and break optimizations. Mark genuinely nondeterministic UDFs with asNondeterministic(). Second, in Unity Catalog, Python UDFs run in a sandboxed environment and SQL UDFs are governed catalog objects you can grant EXECUTE on, so permissions and lineage apply just as they do to tables — another reason SQL UDFs are favored for shared, governed logic.

Test Your Knowledge

Which type of UDF is fastest because it stays inside the engine and is optimized by Catalyst?

A SQL UDF created with CREATE FUNCTION

A row-at-a-time Python UDF

A Pandas UDF

A Scala UDF called from Python

Test Your Knowledge

Why is a plain Python UDF the slowest option in Spark?

It can only process one partition per cluster

It cannot return a value type

It serializes each row across the JVM-to-Python boundary and is opaque to Catalyst

It forces the entire DataFrame into the driver

Test Your Knowledge

Before writing any UDF, what should you do first?

Increase the cluster size to absorb the overhead

Disable the Catalyst optimizer

Convert the DataFrame to an RDD

Check whether a built-in pyspark.sql function already does it

Up Next

2.7 Structured Streaming Fundamentals

Continue learning

Databricks Certified Data Engineer Associate

Databricks Certified Data Engineer Associate

2.6 User-Defined Functions (UDFs)

Key Takeaways

What a UDF Is and When to Use One

UDF Types and Performance

SQL UDFs, Registration, and Governance

Choosing the Right Tool

Pandas UDF Variants and Determinism

Databricks Certified Data Engineer Associate

1Introduction

2Domain 1: Databricks Intelligence Platform (10%)

3Domain 2: Development and Ingestion (30%)

4Domain 3: Data Processing & Transformations (31%)

5Domain 4: Productionizing Data Pipelines (18%)

6Domain 5: Data Governance & Quality (11%)

Databricks Certified Data Engineer Associate

2.6 User-Defined Functions (UDFs)

Key Takeaways

What a UDF Is and When to Use One

UDF Types and Performance

SQL UDFs, Registration, and Governance

Choosing the Right Tool

Pandas UDF Variants and Determinism