Skip to main content

Python PySpark: How to Update Nested Columns in a PySpark DataFrame

PySpark DataFrames support nested columns - columns that contain structured sub-fields defined using StructType. These are common when working with JSON data, complex schemas, or hierarchical information like addresses, names, or dates. While creating nested columns is straightforward, updating individual sub-fields within a nested column requires specific techniques.

In this guide, you will learn what nested columns are, how to create them, and how to update their values - both unconditionally and based on conditions - using PySpark's withField(), col(), lit(), and when() functions.

What Are Nested Columns?

A nested column (also called a struct column) is a column that contains multiple sub-fields. In PySpark, these are defined using StructType with individual sub-fields specified as StructField entries.

Real-world examples of nested columns:

Nested ColumnSub-fields
full_namefirst_name, middle_name, last_name
addressstreet, city, state, zip_code
date_of_birthyear, month, day

You access sub-fields using dot notation: col("full_name.first_name").

Setting Up the Example DataFrame

Let's create a DataFrame with a nested Date_Of_Birth column containing Year, Month, and Date sub-fields:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType

spark = SparkSession.builder.getOrCreate()

# Define the dataset
data = [
((2000, 21, 2), 18),
((1998, 14, 6), 24),
((1998, 1, 11), 18),
((2006, 30, 3), 16),
]

# Define the schema with a nested StructType
schema = StructType([
StructField("Date_Of_Birth", StructType([
StructField("Year", IntegerType(), True),
StructField("Month", IntegerType(), True),
StructField("Date", IntegerType(), True),
])),
StructField("Age", IntegerType(), True),
])

df = spark.createDataFrame(data=data, schema=schema)
df.show()
df.printSchema()

Output:

+--------------+---+
| Date_Of_Birth|Age|
+--------------+---+
| {2000, 21, 2}| 18|
| {1998, 14, 6}| 24|
| {1998, 1, 11}| 18|
| {2006, 30, 3}| 16|
+--------------+---+

root
|-- Date_Of_Birth: struct (nullable = true)
| |-- Year: integer (nullable = true)
| |-- Month: integer (nullable = true)
| |-- Date: integer (nullable = true)
|-- Age: integer (nullable = true)

Updating a Nested Column Unconditionally

To replace the value of a nested sub-field for all rows, use withColumn() combined with col().withField() and lit():

from pyspark.sql.functions import col, lit

# Set the "Year" sub-field to 2024 for all rows
updated_df = df.withColumn(
"Date_Of_Birth",
col("Date_Of_Birth").withField("Year", lit(2024))
)

updated_df.show()

Output:

+--------------+---+
| Date_Of_Birth|Age|
+--------------+---+
| {2024, 21, 2}| 18|
| {2024, 14, 6}| 24|
| {2024, 1, 11}| 18|
| {2024, 30, 3}| 16|
+--------------+---+
info

The withField() method was introduced in Spark 3.1. If you are using an older version, you will need to reconstruct the entire struct manually using struct(). See the alternative approach section below.

Updating a Nested Column Conditionally

In most real-world scenarios, you want to update a nested sub-field only when a specific condition is met. Use the when().otherwise() construct to apply conditional logic.

Example 1: Update Based on the Nested Column's Own Value

Update the Date sub-field: if its current value is 2, replace it with 24; otherwise, keep the original value.

from pyspark.sql.functions import col, lit, when

updated_df = df.withColumn(
"Date_Of_Birth",
col("Date_Of_Birth").withField(
"Date",
when(col("Date_Of_Birth.Date") == 2, lit(24))
.otherwise(col("Date_Of_Birth.Date"))
)
)

updated_df.show()

Output:

+--------------+---+
| Date_Of_Birth|Age|
+--------------+---+
|{2000, 21, 24}| 18|
| {1998, 14, 6}| 24|
| {1998, 1, 11}| 18|
| {2006, 30, 3}| 16|
+--------------+---+

Only the first row's Date changed from 2 to 24. All other rows retained their original values.

Example 2: Update Based on Another Column's Value

Update the Year sub-field: if the Age column equals 18, set Year to 2004; otherwise, keep the original value.

from pyspark.sql.functions import col, lit, when

updated_df = df.withColumn(
"Date_Of_Birth",
col("Date_Of_Birth").withField(
"Year",
when(col("Age") == 18, lit(2004))
.otherwise(col("Date_Of_Birth.Year"))
)
)

updated_df.show()

Output:

+-------------+---+
|Date_Of_Birth|Age|
+-------------+---+
|{2004, 21, 2}| 18|
|{1998, 14, 6}| 24|
|{2004, 1, 11}| 18|
|{2006, 30, 3}| 16|
+-------------+---+

Rows 1 and 3 (where Age == 18) had their Year updated to 2004.

Example 3: Update Multiple Nested Sub-Fields

You can chain multiple withField() calls to update several sub-fields at once:

from pyspark.sql.functions import col, lit, when

updated_df = df.withColumn(
"Date_Of_Birth",
col("Date_Of_Birth")
.withField(
"Year",
when(col("Age") == 18, lit(2004))
.otherwise(col("Date_Of_Birth.Year"))
)
.withField(
"Month",
when(col("Date_Of_Birth.Month") > 12, lit(12))
.otherwise(col("Date_Of_Birth.Month"))
)
)

updated_df.show()

Output:

+-------------+---+
|Date_Of_Birth|Age|
+-------------+---+
|{2004, 12, 2}| 18|
|{1998, 12, 6}| 24|
| {2004, 1, 11}| 18|
|{2006, 12, 3}| 16|
+-------------+---+

Both Year and Month were updated in a single transformation.

Common Mistake: Forgetting the otherwise() Clause

A frequent error when using conditional updates is omitting the otherwise() clause. Without it, rows that don't match the condition will have the sub-field set to null.

Wrong - missing otherwise():

# This sets Year to null for rows where Age != 18
wrong_df = df.withColumn(
"Date_Of_Birth",
col("Date_Of_Birth").withField(
"Year",
when(col("Age") == 18, lit(2004))
# No otherwise(): non-matching rows get null!
)
)

wrong_df.show()

Output:

+--------------+---+
| Date_Of_Birth|Age|
+--------------+---+
| {2004, 21, 2}| 18|
| {null, 14, 6}| 24|
| {2004, 1, 11}| 18|
| {null, 30, 3}| 16|
+--------------+---+

Correct - always include otherwise() to preserve existing values:

correct_df = df.withColumn(
"Date_Of_Birth",
col("Date_Of_Birth").withField(
"Year",
when(col("Age") == 18, lit(2004))
.otherwise(col("Date_Of_Birth.Year")) # Preserve original value
)
)
caution

Always include .otherwise(col("struct_column.sub_field")) when conditionally updating nested columns. Without it, non-matching rows will silently lose their data by being set to null.

Alternative for Spark Versions Before 3.1

If you are using a Spark version older than 3.1 where withField() is not available, you can reconstruct the entire struct using the struct() function:

from pyspark.sql.functions import col, lit, when, struct

updated_df = df.withColumn(
"Date_Of_Birth",
struct(
when(col("Age") == 18, lit(2004))
.otherwise(col("Date_Of_Birth.Year")).alias("Year"),
col("Date_Of_Birth.Month").alias("Month"),
col("Date_Of_Birth.Date").alias("Date"),
)
)

updated_df.show()
updated_df.printSchema()

Output:

+-------------+---+
|Date_Of_Birth|Age|
+-------------+---+
|{2004, 21, 2}| 18|
|{1998, 14, 6}| 24|
|{2004, 1, 11}| 18|
|{2006, 30, 3}| 16|
+-------------+---+

root
|-- Date_Of_Birth: struct (nullable = false)
| |-- Year: integer (nullable = true)
| |-- Month: integer (nullable = true)
| |-- Date: integer (nullable = true)
|-- Age: integer (nullable = true)
note

The struct() approach requires you to list all sub-fields, even those you are not modifying. This is more verbose but works across all Spark versions. The withField() approach (Spark 3.1+) is cleaner because you only specify the sub-field you want to change.

Reading Nested Column Values

To access or filter on nested column values, use dot notation with col():

# Select specific nested sub-fields
df.select(
col("Date_Of_Birth.Year").alias("Birth_Year"),
col("Date_Of_Birth.Month").alias("Birth_Month"),
col("Age"),
).show()

# Filter rows based on a nested sub-field value
df.filter(col("Date_Of_Birth.Year") == 1998).show()

Output:

+----------+-----------+---+
|Birth_Year|Birth_Month|Age|
+----------+-----------+---+
| 2000| 21| 18|
| 1998| 14| 24|
| 1998| 1| 18|
| 2006| 30| 16|
+----------+-----------+---+

+--------------+---+
| Date_Of_Birth|Age|
+--------------+---+
| {1998, 14, 6}| 24|
| {1998, 1, 11}| 18|
+--------------+---+

Summary

Updating nested columns in PySpark requires a combination of withColumn(), col().withField(), and conditional logic. Here are the key takeaways:

TaskMethod
Update a sub-field for all rowscol("struct").withField("sub_field", lit(value))
Update conditionallywhen(condition, lit(new_value)).otherwise(col("struct.sub_field"))
Update multiple sub-fieldsChain multiple .withField() calls
Spark < 3.1 (no withField)Reconstruct with struct() listing all sub-fields
Access nested valuescol("struct.sub_field")
  • Always include .otherwise() to preserve original values for non-matching rows.
  • Use dot notation (col("Date_Of_Birth.Year")) to access nested sub-fields in conditions and selections.
  • Prefer withField() (Spark 3.1+) over struct() reconstruction for cleaner, more maintainable code.