2.6 User-Defined Functions (UDFs)

Key Takeaways

  • Python UDFs register custom Python logic as SQL-callable functions, but they serialize data between JVM and Python, causing performance overhead.
  • SQL UDFs are defined entirely in SQL and run natively in the query engine without serialization overhead.
  • Pandas UDFs (vectorized UDFs) use Apache Arrow for efficient batch processing, significantly outperforming standard Python UDFs.
  • UDFs registered with spark.udf.register() are available in SQL queries; those created with @udf decorator are available in PySpark only.
  • Unity Catalog UDFs can be shared across workspaces and governed with standard permissions.
Last updated: March 2026

User-Defined Functions (UDFs)

Quick Answer: Use SQL UDFs for simple logic (best performance), Pandas UDFs for complex Python operations (good performance via Arrow), and standard Python UDFs as a last resort (poorest performance due to serialization). Register UDFs with spark.udf.register() to make them callable from SQL.

SQL UDFs

SQL UDFs are defined entirely in SQL and execute natively in the query engine:

-- Create a SQL UDF
CREATE OR REPLACE FUNCTION my_catalog.my_schema.celsius_to_fahrenheit(temp DOUBLE)
RETURNS DOUBLE
RETURN (temp * 9/5) + 32;

-- Use the UDF
SELECT city, celsius_to_fahrenheit(temperature) AS temp_f
FROM weather_data;

Benefits of SQL UDFs

  • No serialization overhead — runs natively in the SQL engine
  • Optimized by Catalyst/Photon — can be inlined and optimized
  • Governed by Unity Catalog — standard GRANT/REVOKE permissions
  • Shareable across workspaces — accessible wherever the catalog is available

Python UDFs

Standard Python UDFs

# Method 1: Register for SQL use
from pyspark.sql.types import StringType

def classify_amount(amount):
    if amount > 1000:
        return "high"
    elif amount > 100:
        return "medium"
    else:
        return "low"

# Register UDF for use in SQL
spark.udf.register("classify_amount", classify_amount, StringType())
-- Now callable from SQL
SELECT order_id, amount, classify_amount(amount) AS category
FROM orders;
# Method 2: Using decorator for PySpark DataFrame use
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

@udf(returnType=StringType())
def classify_amount_udf(amount):
    if amount > 1000:
        return "high"
    elif amount > 100:
        return "medium"
    else:
        return "low"

# Use with DataFrame API
df.withColumn("category", classify_amount_udf("amount"))

Performance Warning

Standard Python UDFs have significant overhead:

  1. Data is serialized from JVM (Spark) to Python for each row
  2. Python processes each row one at a time
  3. Results are serialized back from Python to JVM
  4. This row-by-row serialization is slow for large datasets

Pandas UDFs (Vectorized UDFs)

Pandas UDFs use Apache Arrow for efficient batch processing:

import pandas as pd
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import DoubleType

@pandas_udf(DoubleType())
def normalize(series: pd.Series) -> pd.Series:
    return (series - series.mean()) / series.std()

# Use with DataFrame
df.withColumn("normalized_amount", normalize("amount"))

Why Pandas UDFs Are Faster

  • Data transferred in batches (columns of data) via Apache Arrow
  • Arrow uses an in-memory columnar format — no row-by-row serialization
  • Pandas operations are vectorized — NumPy/Pandas process entire columns at once
  • Typically 10-100x faster than standard Python UDFs

UDF Performance Comparison

UDF TypePerformanceBest For
SQL UDFBest (native execution)Simple logic, math, string operations
Pandas UDFGood (Arrow + vectorized)Complex Python logic, statistical operations
Python UDFSlowest (row-by-row serialization)When Pandas/SQL cannot express the logic

On the Exam: Know that SQL UDFs have the best performance because they run natively in the query engine. Pandas UDFs are the recommended approach when Python is needed. Standard Python UDFs should be avoided for large datasets due to serialization overhead.

Test Your Knowledge

Why are standard Python UDFs slower than Pandas UDFs in Spark?

A
B
C
D
Test Your Knowledge

How do you make a Python function callable from a Spark SQL query?

A
B
C
D