2.6 User-Defined Functions (UDFs)
Key Takeaways
- Python UDFs register custom Python logic as SQL-callable functions, but they serialize data between JVM and Python, causing performance overhead.
- SQL UDFs are defined entirely in SQL and run natively in the query engine without serialization overhead.
- Pandas UDFs (vectorized UDFs) use Apache Arrow for efficient batch processing, significantly outperforming standard Python UDFs.
- UDFs registered with spark.udf.register() are available in SQL queries; those created with @udf decorator are available in PySpark only.
- Unity Catalog UDFs can be shared across workspaces and governed with standard permissions.
User-Defined Functions (UDFs)
Quick Answer: Use SQL UDFs for simple logic (best performance), Pandas UDFs for complex Python operations (good performance via Arrow), and standard Python UDFs as a last resort (poorest performance due to serialization). Register UDFs with spark.udf.register() to make them callable from SQL.
SQL UDFs
SQL UDFs are defined entirely in SQL and execute natively in the query engine:
-- Create a SQL UDF
CREATE OR REPLACE FUNCTION my_catalog.my_schema.celsius_to_fahrenheit(temp DOUBLE)
RETURNS DOUBLE
RETURN (temp * 9/5) + 32;
-- Use the UDF
SELECT city, celsius_to_fahrenheit(temperature) AS temp_f
FROM weather_data;
Benefits of SQL UDFs
- No serialization overhead — runs natively in the SQL engine
- Optimized by Catalyst/Photon — can be inlined and optimized
- Governed by Unity Catalog — standard GRANT/REVOKE permissions
- Shareable across workspaces — accessible wherever the catalog is available
Python UDFs
Standard Python UDFs
# Method 1: Register for SQL use
from pyspark.sql.types import StringType
def classify_amount(amount):
if amount > 1000:
return "high"
elif amount > 100:
return "medium"
else:
return "low"
# Register UDF for use in SQL
spark.udf.register("classify_amount", classify_amount, StringType())
-- Now callable from SQL
SELECT order_id, amount, classify_amount(amount) AS category
FROM orders;
# Method 2: Using decorator for PySpark DataFrame use
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
@udf(returnType=StringType())
def classify_amount_udf(amount):
if amount > 1000:
return "high"
elif amount > 100:
return "medium"
else:
return "low"
# Use with DataFrame API
df.withColumn("category", classify_amount_udf("amount"))
Performance Warning
Standard Python UDFs have significant overhead:
- Data is serialized from JVM (Spark) to Python for each row
- Python processes each row one at a time
- Results are serialized back from Python to JVM
- This row-by-row serialization is slow for large datasets
Pandas UDFs (Vectorized UDFs)
Pandas UDFs use Apache Arrow for efficient batch processing:
import pandas as pd
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import DoubleType
@pandas_udf(DoubleType())
def normalize(series: pd.Series) -> pd.Series:
return (series - series.mean()) / series.std()
# Use with DataFrame
df.withColumn("normalized_amount", normalize("amount"))
Why Pandas UDFs Are Faster
- Data transferred in batches (columns of data) via Apache Arrow
- Arrow uses an in-memory columnar format — no row-by-row serialization
- Pandas operations are vectorized — NumPy/Pandas process entire columns at once
- Typically 10-100x faster than standard Python UDFs
UDF Performance Comparison
| UDF Type | Performance | Best For |
|---|---|---|
| SQL UDF | Best (native execution) | Simple logic, math, string operations |
| Pandas UDF | Good (Arrow + vectorized) | Complex Python logic, statistical operations |
| Python UDF | Slowest (row-by-row serialization) | When Pandas/SQL cannot express the logic |
On the Exam: Know that SQL UDFs have the best performance because they run natively in the query engine. Pandas UDFs are the recommended approach when Python is needed. Standard Python UDFs should be avoided for large datasets due to serialization overhead.
Why are standard Python UDFs slower than Pandas UDFs in Spark?
How do you make a Python function callable from a Spark SQL query?