2.6 User-Defined Functions (UDFs)
Key Takeaways
- A UDF extends Spark with custom row-level logic when built-in functions are insufficient; register with spark.udf.register or the @udf decorator.
- SQL UDFs (CREATE FUNCTION) stay inside the engine and are optimized by Catalyst, making them the fastest UDF type.
- Python UDFs incur serialization and per-row JVM-to-Python overhead and are opaque to Catalyst, so they are the slowest.
- Pandas (vectorized) UDFs process data in Arrow batches, dramatically faster than row-at-a-time Python UDFs.
- Always prefer built-in functions over UDFs; reach for UDFs only when no native function exists.
What a UDF Is and When to Use One
A user-defined function (UDF) lets you apply custom logic to each row of a DataFrame when Spark's rich library of built-in functions cannot express what you need. The cardinal rule, tested repeatedly: prefer built-in functions. Built-ins are implemented natively, understood by the Catalyst optimizer, and run inside the JVM with no serialization penalty. A UDF should be the last resort, used only when no native function exists for your transformation.
In PySpark you create one by wrapping a Python function, declaring its return type, and registering it:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
@udf(returnType=StringType())
def initials(name):
return ''.join(w[0] for w in name.split())
df.select(initials('full_name'))
To call it from Spark SQL, register it by name: spark.udf.register('initials', initials), then SELECT initials(full_name) FROM people.
UDF Types and Performance
Not all UDFs cost the same. The exam expects you to rank them by performance:
| UDF type | Optimized by Catalyst? | Speed |
|---|---|---|
SQL UDF (CREATE FUNCTION) | Yes — stays in the engine | Fastest |
| Pandas / vectorized UDF | Partially (Arrow batches) | Fast |
| Python UDF (row-at-a-time) | No — opaque black box | Slowest |
A SQL UDF is defined entirely in SQL and inlined, so Catalyst can optimize it like any expression:
CREATE FUNCTION fahrenheit(c DOUBLE)
RETURNS DOUBLE RETURN c * 9/5 + 32;
A plain Python UDF serializes each row from the JVM to a Python worker, runs your function, and serializes the result back — per row. This crossing of the JVM-Python boundary plus Catalyst's inability to see inside the function makes it the slowest option. A Pandas UDF (decorated with @pandas_udf) instead moves data in Apache Arrow batches and operates on whole pandas.Series at once, eliminating most per-row overhead and running far faster than a scalar Python UDF.
SQL UDFs, Registration, and Governance
In Unity Catalog, a SQL UDF is a first-class, governed object: it is created in a schema (catalog.schema.function), can be granted to users, and is shareable across notebooks and warehouses. Its body is a single SQL expression, so the optimizer inlines it directly into the query plan:
CREATE OR REPLACE FUNCTION main.util.mask_email(e STRING)
RETURNS STRING
RETURN concat(left(e,1), '***@', split(e,'@')[1]);
SELECT main.util.mask_email(email) FROM users;
Because it is just SQL, it carries no serialization cost and benefits from predicate pushdown and constant folding. This is why the exam steers you toward SQL UDFs (or built-ins) over Python UDFs whenever the logic can be expressed declaratively.
A Python UDF, by contrast, must declare its return type explicitly — if you omit or mis-specify it, Spark returns nulls or errors, a classic gotcha. To call a Python UDF from SQL you must register it by name with spark.udf.register('name', fn, returnType). Note also that UDFs run on the executors, not the driver, and that exceptions inside a UDF fail the whole task — so defensive coding (handling nulls, bad input) matters.
Choosing the Right Tool
The decision hierarchy the exam rewards is:
- Built-in function — always first; native, vectorized, Catalyst-aware.
- SQL UDF — when logic is declarative but not covered by a built-in.
- Pandas (vectorized) UDF — when you need Python/library logic at scale; Arrow batching keeps it fast.
- Scalar Python UDF — last resort for genuinely row-by-row custom logic.
| Concern | Built-in / SQL UDF | Python UDF |
|---|---|---|
| Catalyst optimization | Yes | No (opaque) |
| Serialization cost | None | Per-row JVM↔Python |
| Predicate pushdown | Preserved | Often blocked |
| Best for | Standard transforms | Niche custom logic |
A real-world trap: wrapping simple string or math operations in a Python UDF when concat, regexp_replace, split, or round already exist. This needlessly disables optimization and slows the job many-fold. Always search the pyspark.sql.functions library (and Spark SQL functions list) before writing any UDF — the right answer to 'how do I transform this column?' is almost always a built-in, occasionally a SQL UDF, and only rarely a Python UDF.
Pandas UDF Variants and Determinism
The Pandas UDF family (also called vectorized UDFs) comes in a few shapes worth recognizing. A Series-to-Series Pandas UDF takes one or more pandas.Series and returns a Series of the same length — ideal for elementwise math or string work with NumPy/pandas. An iterator variant streams Arrow batches for expensive one-time setup (like loading an ML model once per partition). A grouped map applies a function to each group of a groupBy and can change the row count, useful for per-group transformations like normalization.
Two governance points round out the topic. First, UDFs should be deterministic — the same input must yield the same output — because Spark may evaluate, re-evaluate, or reorder them; a UDF calling random() or the current time can produce inconsistent results and break optimizations. Mark genuinely nondeterministic UDFs with asNondeterministic(). Second, in Unity Catalog, Python UDFs run in a sandboxed environment and SQL UDFs are governed catalog objects you can grant EXECUTE on, so permissions and lineage apply just as they do to tables — another reason SQL UDFs are favored for shared, governed logic.
Which type of UDF is fastest because it stays inside the engine and is optimized by Catalyst?
Why is a plain Python UDF the slowest option in Spark?
Before writing any UDF, what should you do first?