2.6 User-Defined Functions (UDFs)

Key Takeaways

  • A UDF extends Spark with custom row-level logic when built-in functions are insufficient; register with spark.udf.register or the @udf decorator.
  • SQL UDFs (CREATE FUNCTION) stay inside the engine and are optimized by Catalyst, making them the fastest UDF type.
  • Python UDFs incur serialization and per-row JVM-to-Python overhead and are opaque to Catalyst, so they are the slowest.
  • Pandas (vectorized) UDFs process data in Arrow batches, dramatically faster than row-at-a-time Python UDFs.
  • Always prefer built-in functions over UDFs; reach for UDFs only when no native function exists.
Last updated: June 2026

What a UDF Is and When to Use One

A user-defined function (UDF) lets you apply custom logic to each row of a DataFrame when Spark's rich library of built-in functions cannot express what you need. The cardinal rule, tested repeatedly: prefer built-in functions. Built-ins are implemented natively, understood by the Catalyst optimizer, and run inside the JVM with no serialization penalty. A UDF should be the last resort, used only when no native function exists for your transformation.

In PySpark you create one by wrapping a Python function, declaring its return type, and registering it:

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

@udf(returnType=StringType())
def initials(name):
    return ''.join(w[0] for w in name.split())

df.select(initials('full_name'))

To call it from Spark SQL, register it by name: spark.udf.register('initials', initials), then SELECT initials(full_name) FROM people.

UDF Types and Performance

Not all UDFs cost the same. The exam expects you to rank them by performance:

UDF typeOptimized by Catalyst?Speed
SQL UDF (CREATE FUNCTION)Yes — stays in the engineFastest
Pandas / vectorized UDFPartially (Arrow batches)Fast
Python UDF (row-at-a-time)No — opaque black boxSlowest

A SQL UDF is defined entirely in SQL and inlined, so Catalyst can optimize it like any expression:

CREATE FUNCTION fahrenheit(c DOUBLE)
RETURNS DOUBLE RETURN c * 9/5 + 32;

A plain Python UDF serializes each row from the JVM to a Python worker, runs your function, and serializes the result back — per row. This crossing of the JVM-Python boundary plus Catalyst's inability to see inside the function makes it the slowest option. A Pandas UDF (decorated with @pandas_udf) instead moves data in Apache Arrow batches and operates on whole pandas.Series at once, eliminating most per-row overhead and running far faster than a scalar Python UDF.

SQL UDFs, Registration, and Governance

In Unity Catalog, a SQL UDF is a first-class, governed object: it is created in a schema (catalog.schema.function), can be granted to users, and is shareable across notebooks and warehouses. Its body is a single SQL expression, so the optimizer inlines it directly into the query plan:

CREATE OR REPLACE FUNCTION main.util.mask_email(e STRING)
RETURNS STRING
RETURN concat(left(e,1), '***@', split(e,'@')[1]);

SELECT main.util.mask_email(email) FROM users;

Because it is just SQL, it carries no serialization cost and benefits from predicate pushdown and constant folding. This is why the exam steers you toward SQL UDFs (or built-ins) over Python UDFs whenever the logic can be expressed declaratively.

A Python UDF, by contrast, must declare its return type explicitly — if you omit or mis-specify it, Spark returns nulls or errors, a classic gotcha. To call a Python UDF from SQL you must register it by name with spark.udf.register('name', fn, returnType). Note also that UDFs run on the executors, not the driver, and that exceptions inside a UDF fail the whole task — so defensive coding (handling nulls, bad input) matters.

Choosing the Right Tool

The decision hierarchy the exam rewards is:

  1. Built-in function — always first; native, vectorized, Catalyst-aware.
  2. SQL UDF — when logic is declarative but not covered by a built-in.
  3. Pandas (vectorized) UDF — when you need Python/library logic at scale; Arrow batching keeps it fast.
  4. Scalar Python UDF — last resort for genuinely row-by-row custom logic.
ConcernBuilt-in / SQL UDFPython UDF
Catalyst optimizationYesNo (opaque)
Serialization costNonePer-row JVM↔Python
Predicate pushdownPreservedOften blocked
Best forStandard transformsNiche custom logic

A real-world trap: wrapping simple string or math operations in a Python UDF when concat, regexp_replace, split, or round already exist. This needlessly disables optimization and slows the job many-fold. Always search the pyspark.sql.functions library (and Spark SQL functions list) before writing any UDF — the right answer to 'how do I transform this column?' is almost always a built-in, occasionally a SQL UDF, and only rarely a Python UDF.

Pandas UDF Variants and Determinism

The Pandas UDF family (also called vectorized UDFs) comes in a few shapes worth recognizing. A Series-to-Series Pandas UDF takes one or more pandas.Series and returns a Series of the same length — ideal for elementwise math or string work with NumPy/pandas. An iterator variant streams Arrow batches for expensive one-time setup (like loading an ML model once per partition). A grouped map applies a function to each group of a groupBy and can change the row count, useful for per-group transformations like normalization.

Two governance points round out the topic. First, UDFs should be deterministic — the same input must yield the same output — because Spark may evaluate, re-evaluate, or reorder them; a UDF calling random() or the current time can produce inconsistent results and break optimizations. Mark genuinely nondeterministic UDFs with asNondeterministic(). Second, in Unity Catalog, Python UDFs run in a sandboxed environment and SQL UDFs are governed catalog objects you can grant EXECUTE on, so permissions and lineage apply just as they do to tables — another reason SQL UDFs are favored for shared, governed logic.

Test Your Knowledge

Which type of UDF is fastest because it stays inside the engine and is optimized by Catalyst?

A
B
C
D
Test Your Knowledge

Why is a plain Python UDF the slowest option in Spark?

A
B
C
D
Test Your Knowledge

Before writing any UDF, what should you do first?

A
B
C
D