2.6 User-Defined Functions (UDFs)

Key Takeaways

Python UDFs register custom Python logic as SQL-callable functions, but they serialize data between JVM and Python, causing performance overhead.
SQL UDFs are defined entirely in SQL and run natively in the query engine without serialization overhead.
Pandas UDFs (vectorized UDFs) use Apache Arrow for efficient batch processing, significantly outperforming standard Python UDFs.
UDFs registered with spark.udf.register() are available in SQL queries; those created with @udf decorator are available in PySpark only.
Unity Catalog UDFs can be shared across workspaces and governed with standard permissions.

Last updated: March 2026

User-Defined Functions (UDFs)

Quick Answer: Use SQL UDFs for simple logic (best performance), Pandas UDFs for complex Python operations (good performance via Arrow), and standard Python UDFs as a last resort (poorest performance due to serialization). Register UDFs with spark.udf.register() to make them callable from SQL.

SQL UDFs

SQL UDFs are defined entirely in SQL and execute natively in the query engine:

-- Create a SQL UDF
CREATE OR REPLACE FUNCTION my_catalog.my_schema.celsius_to_fahrenheit(temp DOUBLE)
RETURNS DOUBLE
RETURN (temp * 9/5) + 32;

-- Use the UDF
SELECT city, celsius_to_fahrenheit(temperature) AS temp_f
FROM weather_data;

Benefits of SQL UDFs

No serialization overhead — runs natively in the SQL engine
Optimized by Catalyst/Photon — can be inlined and optimized
Governed by Unity Catalog — standard GRANT/REVOKE permissions
Shareable across workspaces — accessible wherever the catalog is available

Python UDFs

Standard Python UDFs

# Method 1: Register for SQL use
from pyspark.sql.types import StringType

def classify_amount(amount):
    if amount > 1000:
        return "high"
    elif amount > 100:
        return "medium"
    else:
        return "low"

# Register UDF for use in SQL
spark.udf.register("classify_amount", classify_amount, StringType())

-- Now callable from SQL
SELECT order_id, amount, classify_amount(amount) AS category
FROM orders;

# Method 2: Using decorator for PySpark DataFrame use
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

@udf(returnType=StringType())
def classify_amount_udf(amount):
    if amount > 1000:
        return "high"
    elif amount > 100:
        return "medium"
    else:
        return "low"

# Use with DataFrame API
df.withColumn("category", classify_amount_udf("amount"))

Performance Warning

Standard Python UDFs have significant overhead:

Data is serialized from JVM (Spark) to Python for each row
Python processes each row one at a time
Results are serialized back from Python to JVM
This row-by-row serialization is slow for large datasets

Pandas UDFs (Vectorized UDFs)

Pandas UDFs use Apache Arrow for efficient batch processing:

import pandas as pd
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import DoubleType

@pandas_udf(DoubleType())
def normalize(series: pd.Series) -> pd.Series:
    return (series - series.mean()) / series.std()

# Use with DataFrame
df.withColumn("normalized_amount", normalize("amount"))

Why Pandas UDFs Are Faster

Data transferred in batches (columns of data) via Apache Arrow
Arrow uses an in-memory columnar format — no row-by-row serialization
Pandas operations are vectorized — NumPy/Pandas process entire columns at once
Typically 10-100x faster than standard Python UDFs

UDF Performance Comparison

UDF Type	Performance	Best For
SQL UDF	Best (native execution)	Simple logic, math, string operations
Pandas UDF	Good (Arrow + vectorized)	Complex Python logic, statistical operations
Python UDF	Slowest (row-by-row serialization)	When Pandas/SQL cannot express the logic

On the Exam: Know that SQL UDFs have the best performance because they run natively in the query engine. Pandas UDFs are the recommended approach when Python is needed. Standard Python UDFs should be avoided for large datasets due to serialization overhead.

Test Your Knowledge

Why are standard Python UDFs slower than Pandas UDFs in Spark?

Python UDFs cannot access Spark DataFrame columns

Python UDFs process data row-by-row with JVM-to-Python serialization, while Pandas UDFs use Apache Arrow for batch processing

Pandas UDFs bypass the Spark engine entirely

Python UDFs require more memory than Pandas UDFs

Test Your Knowledge

How do you make a Python function callable from a Spark SQL query?

Use the @sql_udf decorator

Use spark.udf.register() to register the function with a SQL name

Define the function inside a %sql cell

Python functions are automatically available in SQL

Up Next

2.7 Structured Streaming Fundamentals

Continue learning

Databricks Certified Data Engineer Associate

1Introduction

2Domain 1: Databricks Intelligence Platform (10%)

3Domain 2: Development and Ingestion (30%)

4Domain 3: Data Processing & Transformations (31%)

5Domain 4: Productionizing Data Pipelines (18%)

6Domain 5: Data Governance & Quality (11%)