Python PySpark: How to Add a Column from a List of Values Using a UDF in PySpark
When working with PySpark DataFrames, you may encounter situations where you have a list of values that needs to be added as a new column to an existing DataFrame. Unlike Pandas, PySpark does not allow direct column assignment from a Python list because DataFrames are distributed across a cluster. However, you can achieve this by combining row numbering with a User Defined Function (UDF).
In this guide, you will learn how to map a Python list to a PySpark DataFrame column using monotonically_increasing_id, row_number, and a UDF - with clear examples and explanations.
Understanding the Approach
The strategy involves three steps:
- Assign sequential row numbers to the DataFrame using
row_number()withmonotonically_increasing_id()as the ordering column. - Create a UDF that takes each row number as input and returns the corresponding value from the Python list.
- Apply the UDF to create the new column, then drop the temporary row number column.
This approach works reliably when the number of elements in the list matches the number of rows in the DataFrame. If they differ, you will get an IndexError at runtime.
Adding a String Column from a List
In this example, we create a DataFrame with student data and add a Names column from a Python list.
Step-by-Step Code
from pyspark.sql import SparkSession, functions as F
from pyspark.sql.types import StringType
from pyspark.sql.functions import row_number, monotonically_increasing_id
from pyspark.sql.window import Window
# Step 1: Create a Spark session
spark = SparkSession.builder.getOrCreate()
# Step 2: Create a sample DataFrame
df = spark.createDataFrame(
[(1, 10000, 400),
(2, 14000, 500),
(3, 12000, 800)],
["Roll_Number", "Fees", "Fine"]
)
print("Original DataFrame:")
df.show()
Output:
Original DataFrame:
+-----------+-----+----+
|Roll_Number| Fees|Fine|
+-----------+-----+----+
| 1|10000| 400|
| 2|14000| 500|
| 3|12000| 800|
+-----------+-----+----+
Now add the list as a new column:
# Step 3: Define the list of values to add
student_names = ["Anna", "David", "Marty"]
# Step 4: Create a UDF that maps row index to list value safely
labels_udf = F.udf(
lambda idx: student_names[idx - 1] if idx and idx <= len(student_names) else None,
StringType()
)
# Step 5: Add sequential row numbers (safe ordering)
window_spec = Window.orderBy(F.lit(1))
df_with_id = df.withColumn(
"row_id",
row_number().over(window_spec)
)
# Step 6: Apply the UDF to create the new column
result_df = df_with_id.withColumn("Names", labels_udf("row_id"))
# Step 7: Drop the temporary row_id column and display
result_df.drop("row_id").show()
Output:
+-----------+-----+----+-----+
|Roll_Number| Fees|Fine|Names|
+-----------+-----+----+-----+
| 1|10000| 400| Anna|
| 2|14000| 500|David|
| 3|12000| 800|Marty|
+-----------+-----+----+-----+
How It Works
monotonically_increasing_id()generates unique, monotonically increasing IDs across the DataFrame.row_number().over(Window.orderBy(...))converts those IDs into consecutive integers starting from 1.- The UDF
labels_udfreceives each row number and returnsstudent_names[idx - 1](subtracting 1 because list indexing starts at 0).
Adding an Integer Column from a List
The same technique works for numeric data. Simply change the UDF's return type to IntegerType().
from pyspark.sql import SparkSession, functions as F
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import row_number
from pyspark.sql.window import Window
spark = SparkSession.builder.getOrCreate()
# Sample DataFrame with US names
df = spark.createDataFrame(
[(1, "James", 10000),
(2, "Emma", 14000),
(3, "Liam", 12000),
(4, "Olivia", 11000),
(5, "Noah", 13000)],
["Roll_Number", "Name", "Fees"]
)
# List of fine values
fine_data = [200, 300, 400, 0, 500]
# UDF
fine_udf = F.udf(
lambda idx: fine_data[idx - 1] if idx and idx <= len(fine_data) else None,
IntegerType()
)
# Deterministic ordering
window_spec = Window.orderBy("Roll_Number")
df_with_id = df.withColumn(
"row_id",
row_number().over(window_spec)
)
result_df = df_with_id.withColumn("Fine", fine_udf("row_id"))
result_df.drop("row_id").show()
Output:
+-----------+-------+-----+----+
|Roll_Number| Name| Fees|Fine|
+-----------+-------+-----+----+
| 1| James|10000| 200|
| 2| Emma|14000| 300|
| 3| Liam|12000| 400|
| 4| Olivia|11000| 0|
| 5| Noah|13000| 500|
+-----------+-------+-----+----+
Common Mistake: Mismatched List and DataFrame Sizes
If your list has fewer elements than the DataFrame has rows, the UDF will throw an IndexError.
Wrong: list has 3 elements but DataFrame has 5 rows
fine_data = [200, 300, 400] # Only 3 values for 5 rows!
fine_udf = F.udf(lambda idx: fine_data[idx - 1], IntegerType())
df_with_id = df.withColumn(
"row_id",
row_number().over(Window.orderBy(monotonically_increasing_id()))
)
# This will fail when processing rows 4 and 5
result_df = df_with_id.withColumn("Fine", fine_udf("row_id"))
result_df.show()
Error
PythonException: IndexError: list index out of range
Correct: add validation in the UDF
fine_data = [200, 300, 400]
# Return None for out-of-range indices
safe_udf = F.udf(
lambda idx: fine_data[idx - 1] if idx <= len(fine_data) else None,
IntegerType()
)
result_df = df_with_id.withColumn("Fine", safe_udf("row_id"))
result_df.drop("row_id").show()
Output:
+-----------+-------+-----+----+
|Roll_Number| Name| Fees|Fine|
+-----------+-------+-----+----+
| 1| James|10000| 200|
| 2| Emma|14000| 300|
| 3| Liam|12000| 400|
| 4| Olivia|11000|null|
| 5| Noah|13000|null|
+-----------+-------+-----+----+
Alternative Approach: Using createDataFrame and Join
For better performance - especially with large DataFrames - consider converting the list into a DataFrame and joining instead of using a UDF:
from pyspark.sql import SparkSession
from pyspark.sql.functions import row_number, monotonically_increasing_id
from pyspark.sql.window import Window
spark = SparkSession.builder.getOrCreate()
# Original DataFrame
df = spark.createDataFrame(
[(1, 10000), (2, 14000), (3, 12000)],
["Roll_Number", "Fees"]
)
# Add row IDs
df = df.withColumn(
"row_id",
row_number().over(Window.orderBy(monotonically_increasing_id()))
)
# Create a DataFrame from the list with matching row IDs
names = ["Aman", "Ishita", "Vinayak"]
names_df = spark.createDataFrame(
[(i + 1, name) for i, name in enumerate(names)],
["row_id", "Name"]
)
# Join on row_id and drop the temporary column
result_df = df.join(names_df, on="row_id").drop("row_id")
result_df.show()
Output:
+-----------+-----+-----+
|Roll_Number| Fees| Name|
+-----------+-----+-----+
| 1|10000|James|
| 2|14000| Emma|
| 3|12000| Liam|
+-----------+-----+-----+
The join-based approach is generally preferred in production because:
- UDFs run in Python and cannot be optimized by Spark's Catalyst optimizer.
- Joins are executed natively in JVM and benefit from Spark's built-in optimizations.
- It avoids serialization overhead between the JVM and Python processes.
Summary
| Approach | Best For | Performance |
|---|---|---|
UDF with row_number | Quick prototyping, small DataFrames | Slower (Python UDF overhead) |
| Join with list-based DataFrame | Production workloads, large DataFrames | Faster (native Spark execution) |
Key steps to add a column from a list using a UDF:
- Generate row numbers using
row_number()overmonotonically_increasing_id(). - Define a UDF that maps each row number to the corresponding list element.
- Apply the UDF to create the new column and drop the temporary row number column.
- Always validate that the list length matches the DataFrame row count to avoid
IndexError.
For performance-critical applications, prefer the join-based approach over UDFs.