Skip to main content

Python PySpark: How to Add a Column from a List of Values Using a UDF in PySpark

When working with PySpark DataFrames, you may encounter situations where you have a list of values that needs to be added as a new column to an existing DataFrame. Unlike Pandas, PySpark does not allow direct column assignment from a Python list because DataFrames are distributed across a cluster. However, you can achieve this by combining row numbering with a User Defined Function (UDF).

In this guide, you will learn how to map a Python list to a PySpark DataFrame column using monotonically_increasing_id, row_number, and a UDF - with clear examples and explanations.

Understanding the Approach

The strategy involves three steps:

  1. Assign sequential row numbers to the DataFrame using row_number() with monotonically_increasing_id() as the ordering column.
  2. Create a UDF that takes each row number as input and returns the corresponding value from the Python list.
  3. Apply the UDF to create the new column, then drop the temporary row number column.
info

This approach works reliably when the number of elements in the list matches the number of rows in the DataFrame. If they differ, you will get an IndexError at runtime.

Adding a String Column from a List

In this example, we create a DataFrame with student data and add a Names column from a Python list.

Step-by-Step Code

from pyspark.sql import SparkSession, functions as F
from pyspark.sql.types import StringType
from pyspark.sql.functions import row_number, monotonically_increasing_id
from pyspark.sql.window import Window

# Step 1: Create a Spark session
spark = SparkSession.builder.getOrCreate()

# Step 2: Create a sample DataFrame
df = spark.createDataFrame(
[(1, 10000, 400),
(2, 14000, 500),
(3, 12000, 800)],
["Roll_Number", "Fees", "Fine"]
)

print("Original DataFrame:")
df.show()

Output:

Original DataFrame:
+-----------+-----+----+
|Roll_Number| Fees|Fine|
+-----------+-----+----+
| 1|10000| 400|
| 2|14000| 500|
| 3|12000| 800|
+-----------+-----+----+

Now add the list as a new column:

# Step 3: Define the list of values to add
student_names = ["Anna", "David", "Marty"]

# Step 4: Create a UDF that maps row index to list value safely
labels_udf = F.udf(
lambda idx: student_names[idx - 1] if idx and idx <= len(student_names) else None,
StringType()
)

# Step 5: Add sequential row numbers (safe ordering)
window_spec = Window.orderBy(F.lit(1))
df_with_id = df.withColumn(
"row_id",
row_number().over(window_spec)
)

# Step 6: Apply the UDF to create the new column
result_df = df_with_id.withColumn("Names", labels_udf("row_id"))

# Step 7: Drop the temporary row_id column and display
result_df.drop("row_id").show()

Output:

+-----------+-----+----+-----+
|Roll_Number| Fees|Fine|Names|
+-----------+-----+----+-----+
| 1|10000| 400| Anna|
| 2|14000| 500|David|
| 3|12000| 800|Marty|
+-----------+-----+----+-----+

How It Works

  1. monotonically_increasing_id() generates unique, monotonically increasing IDs across the DataFrame.
  2. row_number().over(Window.orderBy(...)) converts those IDs into consecutive integers starting from 1.
  3. The UDF labels_udf receives each row number and returns student_names[idx - 1] (subtracting 1 because list indexing starts at 0).

Adding an Integer Column from a List

The same technique works for numeric data. Simply change the UDF's return type to IntegerType().

from pyspark.sql import SparkSession, functions as F
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import row_number
from pyspark.sql.window import Window

spark = SparkSession.builder.getOrCreate()

# Sample DataFrame with US names
df = spark.createDataFrame(
[(1, "James", 10000),
(2, "Emma", 14000),
(3, "Liam", 12000),
(4, "Olivia", 11000),
(5, "Noah", 13000)],
["Roll_Number", "Name", "Fees"]
)

# List of fine values
fine_data = [200, 300, 400, 0, 500]

# UDF
fine_udf = F.udf(
lambda idx: fine_data[idx - 1] if idx and idx <= len(fine_data) else None,
IntegerType()
)

# Deterministic ordering
window_spec = Window.orderBy("Roll_Number")

df_with_id = df.withColumn(
"row_id",
row_number().over(window_spec)
)

result_df = df_with_id.withColumn("Fine", fine_udf("row_id"))

result_df.drop("row_id").show()

Output:

+-----------+-------+-----+----+
|Roll_Number| Name| Fees|Fine|
+-----------+-------+-----+----+
| 1| James|10000| 200|
| 2| Emma|14000| 300|
| 3| Liam|12000| 400|
| 4| Olivia|11000| 0|
| 5| Noah|13000| 500|
+-----------+-------+-----+----+

Common Mistake: Mismatched List and DataFrame Sizes

If your list has fewer elements than the DataFrame has rows, the UDF will throw an IndexError.

Wrong: list has 3 elements but DataFrame has 5 rows

fine_data = [200, 300, 400]  # Only 3 values for 5 rows!

fine_udf = F.udf(lambda idx: fine_data[idx - 1], IntegerType())

df_with_id = df.withColumn(
"row_id",
row_number().over(Window.orderBy(monotonically_increasing_id()))
)

# This will fail when processing rows 4 and 5
result_df = df_with_id.withColumn("Fine", fine_udf("row_id"))
result_df.show()

Error

PythonException: IndexError: list index out of range

Correct: add validation in the UDF

fine_data = [200, 300, 400]

# Return None for out-of-range indices
safe_udf = F.udf(
lambda idx: fine_data[idx - 1] if idx <= len(fine_data) else None,
IntegerType()
)

result_df = df_with_id.withColumn("Fine", safe_udf("row_id"))
result_df.drop("row_id").show()

Output:

+-----------+-------+-----+----+
|Roll_Number| Name| Fees|Fine|
+-----------+-------+-----+----+
| 1| James|10000| 200|
| 2| Emma|14000| 300|
| 3| Liam|12000| 400|
| 4| Olivia|11000|null|
| 5| Noah|13000|null|
+-----------+-------+-----+----+

Alternative Approach: Using createDataFrame and Join

For better performance - especially with large DataFrames - consider converting the list into a DataFrame and joining instead of using a UDF:

from pyspark.sql import SparkSession
from pyspark.sql.functions import row_number, monotonically_increasing_id
from pyspark.sql.window import Window

spark = SparkSession.builder.getOrCreate()

# Original DataFrame
df = spark.createDataFrame(
[(1, 10000), (2, 14000), (3, 12000)],
["Roll_Number", "Fees"]
)

# Add row IDs
df = df.withColumn(
"row_id",
row_number().over(Window.orderBy(monotonically_increasing_id()))
)

# Create a DataFrame from the list with matching row IDs
names = ["Aman", "Ishita", "Vinayak"]
names_df = spark.createDataFrame(
[(i + 1, name) for i, name in enumerate(names)],
["row_id", "Name"]
)

# Join on row_id and drop the temporary column
result_df = df.join(names_df, on="row_id").drop("row_id")
result_df.show()

Output:

+-----------+-----+-----+
|Roll_Number| Fees| Name|
+-----------+-----+-----+
| 1|10000|James|
| 2|14000| Emma|
| 3|12000| Liam|
+-----------+-----+-----+
tip

The join-based approach is generally preferred in production because:

  • UDFs run in Python and cannot be optimized by Spark's Catalyst optimizer.
  • Joins are executed natively in JVM and benefit from Spark's built-in optimizations.
  • It avoids serialization overhead between the JVM and Python processes.

Summary

ApproachBest ForPerformance
UDF with row_numberQuick prototyping, small DataFramesSlower (Python UDF overhead)
Join with list-based DataFrameProduction workloads, large DataFramesFaster (native Spark execution)

Key steps to add a column from a list using a UDF:

  1. Generate row numbers using row_number() over monotonically_increasing_id().
  2. Define a UDF that maps each row number to the corresponding list element.
  3. Apply the UDF to create the new column and drop the temporary row number column.
  4. Always validate that the list length matches the DataFrame row count to avoid IndexError.

For performance-critical applications, prefer the join-based approach over UDFs.