Python PySpark: How to Update Nested Columns in a PySpark DataFrame
PySpark DataFrames support nested columns - columns that contain structured sub-fields defined using StructType. These are common when working with JSON data, complex schemas, or hierarchical information like addresses, names, or dates. While creating nested columns is straightforward, updating individual sub-fields within a nested column requires specific techniques.
In this guide, you will learn what nested columns are, how to create them, and how to update their values - both unconditionally and based on conditions - using PySpark's withField(), col(), lit(), and when() functions.
What Are Nested Columns?
A nested column (also called a struct column) is a column that contains multiple sub-fields. In PySpark, these are defined using StructType with individual sub-fields specified as StructField entries.
Real-world examples of nested columns:
| Nested Column | Sub-fields |
|---|---|
full_name | first_name, middle_name, last_name |
address | street, city, state, zip_code |
date_of_birth | year, month, day |
You access sub-fields using dot notation: col("full_name.first_name").
Setting Up the Example DataFrame
Let's create a DataFrame with a nested Date_Of_Birth column containing Year, Month, and Date sub-fields:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType
spark = SparkSession.builder.getOrCreate()
# Define the dataset
data = [
((2000, 21, 2), 18),
((1998, 14, 6), 24),
((1998, 1, 11), 18),
((2006, 30, 3), 16),
]
# Define the schema with a nested StructType
schema = StructType([
StructField("Date_Of_Birth", StructType([
StructField("Year", IntegerType(), True),
StructField("Month", IntegerType(), True),
StructField("Date", IntegerType(), True),
])),
StructField("Age", IntegerType(), True),
])
df = spark.createDataFrame(data=data, schema=schema)
df.show()
df.printSchema()
Output:
+--------------+---+
| Date_Of_Birth|Age|
+--------------+---+
| {2000, 21, 2}| 18|
| {1998, 14, 6}| 24|
| {1998, 1, 11}| 18|
| {2006, 30, 3}| 16|
+--------------+---+
root
|-- Date_Of_Birth: struct (nullable = true)
| |-- Year: integer (nullable = true)
| |-- Month: integer (nullable = true)
| |-- Date: integer (nullable = true)
|-- Age: integer (nullable = true)
Updating a Nested Column Unconditionally
To replace the value of a nested sub-field for all rows, use withColumn() combined with col().withField() and lit():
from pyspark.sql.functions import col, lit
# Set the "Year" sub-field to 2024 for all rows
updated_df = df.withColumn(
"Date_Of_Birth",
col("Date_Of_Birth").withField("Year", lit(2024))
)
updated_df.show()
Output:
+--------------+---+
| Date_Of_Birth|Age|
+--------------+---+
| {2024, 21, 2}| 18|
| {2024, 14, 6}| 24|
| {2024, 1, 11}| 18|
| {2024, 30, 3}| 16|
+--------------+---+
The withField() method was introduced in Spark 3.1. If you are using an older version, you will need to reconstruct the entire struct manually using struct(). See the alternative approach section below.
Updating a Nested Column Conditionally
In most real-world scenarios, you want to update a nested sub-field only when a specific condition is met. Use the when().otherwise() construct to apply conditional logic.
Example 1: Update Based on the Nested Column's Own Value
Update the Date sub-field: if its current value is 2, replace it with 24; otherwise, keep the original value.
from pyspark.sql.functions import col, lit, when
updated_df = df.withColumn(
"Date_Of_Birth",
col("Date_Of_Birth").withField(
"Date",
when(col("Date_Of_Birth.Date") == 2, lit(24))
.otherwise(col("Date_Of_Birth.Date"))
)
)
updated_df.show()
Output:
+--------------+---+
| Date_Of_Birth|Age|
+--------------+---+
|{2000, 21, 24}| 18|
| {1998, 14, 6}| 24|
| {1998, 1, 11}| 18|
| {2006, 30, 3}| 16|
+--------------+---+
Only the first row's Date changed from 2 to 24. All other rows retained their original values.
Example 2: Update Based on Another Column's Value
Update the Year sub-field: if the Age column equals 18, set Year to 2004; otherwise, keep the original value.
from pyspark.sql.functions import col, lit, when
updated_df = df.withColumn(
"Date_Of_Birth",
col("Date_Of_Birth").withField(
"Year",
when(col("Age") == 18, lit(2004))
.otherwise(col("Date_Of_Birth.Year"))
)
)
updated_df.show()
Output:
+-------------+---+
|Date_Of_Birth|Age|
+-------------+---+
|{2004, 21, 2}| 18|
|{1998, 14, 6}| 24|
|{2004, 1, 11}| 18|
|{2006, 30, 3}| 16|
+-------------+---+
Rows 1 and 3 (where Age == 18) had their Year updated to 2004.
Example 3: Update Multiple Nested Sub-Fields
You can chain multiple withField() calls to update several sub-fields at once:
from pyspark.sql.functions import col, lit, when
updated_df = df.withColumn(
"Date_Of_Birth",
col("Date_Of_Birth")
.withField(
"Year",
when(col("Age") == 18, lit(2004))
.otherwise(col("Date_Of_Birth.Year"))
)
.withField(
"Month",
when(col("Date_Of_Birth.Month") > 12, lit(12))
.otherwise(col("Date_Of_Birth.Month"))
)
)
updated_df.show()
Output:
+-------------+---+
|Date_Of_Birth|Age|
+-------------+---+
|{2004, 12, 2}| 18|
|{1998, 12, 6}| 24|
| {2004, 1, 11}| 18|
|{2006, 12, 3}| 16|
+-------------+---+
Both Year and Month were updated in a single transformation.
Common Mistake: Forgetting the otherwise() Clause
A frequent error when using conditional updates is omitting the otherwise() clause. Without it, rows that don't match the condition will have the sub-field set to null.
Wrong - missing otherwise():
# This sets Year to null for rows where Age != 18
wrong_df = df.withColumn(
"Date_Of_Birth",
col("Date_Of_Birth").withField(
"Year",
when(col("Age") == 18, lit(2004))
# No otherwise(): non-matching rows get null!
)
)
wrong_df.show()
Output:
+--------------+---+
| Date_Of_Birth|Age|
+--------------+---+
| {2004, 21, 2}| 18|
| {null, 14, 6}| 24|
| {2004, 1, 11}| 18|
| {null, 30, 3}| 16|
+--------------+---+
Correct - always include otherwise() to preserve existing values:
correct_df = df.withColumn(
"Date_Of_Birth",
col("Date_Of_Birth").withField(
"Year",
when(col("Age") == 18, lit(2004))
.otherwise(col("Date_Of_Birth.Year")) # Preserve original value
)
)
Always include .otherwise(col("struct_column.sub_field")) when conditionally updating nested columns. Without it, non-matching rows will silently lose their data by being set to null.
Alternative for Spark Versions Before 3.1
If you are using a Spark version older than 3.1 where withField() is not available, you can reconstruct the entire struct using the struct() function:
from pyspark.sql.functions import col, lit, when, struct
updated_df = df.withColumn(
"Date_Of_Birth",
struct(
when(col("Age") == 18, lit(2004))
.otherwise(col("Date_Of_Birth.Year")).alias("Year"),
col("Date_Of_Birth.Month").alias("Month"),
col("Date_Of_Birth.Date").alias("Date"),
)
)
updated_df.show()
updated_df.printSchema()
Output:
+-------------+---+
|Date_Of_Birth|Age|
+-------------+---+
|{2004, 21, 2}| 18|
|{1998, 14, 6}| 24|
|{2004, 1, 11}| 18|
|{2006, 30, 3}| 16|
+-------------+---+
root
|-- Date_Of_Birth: struct (nullable = false)
| |-- Year: integer (nullable = true)
| |-- Month: integer (nullable = true)
| |-- Date: integer (nullable = true)
|-- Age: integer (nullable = true)
The struct() approach requires you to list all sub-fields, even those you are not modifying. This is more verbose but works across all Spark versions. The withField() approach (Spark 3.1+) is cleaner because you only specify the sub-field you want to change.
Reading Nested Column Values
To access or filter on nested column values, use dot notation with col():
# Select specific nested sub-fields
df.select(
col("Date_Of_Birth.Year").alias("Birth_Year"),
col("Date_Of_Birth.Month").alias("Birth_Month"),
col("Age"),
).show()
# Filter rows based on a nested sub-field value
df.filter(col("Date_Of_Birth.Year") == 1998).show()
Output:
+----------+-----------+---+
|Birth_Year|Birth_Month|Age|
+----------+-----------+---+
| 2000| 21| 18|
| 1998| 14| 24|
| 1998| 1| 18|
| 2006| 30| 16|
+----------+-----------+---+
+--------------+---+
| Date_Of_Birth|Age|
+--------------+---+
| {1998, 14, 6}| 24|
| {1998, 1, 11}| 18|
+--------------+---+
Summary
Updating nested columns in PySpark requires a combination of withColumn(), col().withField(), and conditional logic. Here are the key takeaways:
| Task | Method |
|---|---|
| Update a sub-field for all rows | col("struct").withField("sub_field", lit(value)) |
| Update conditionally | when(condition, lit(new_value)).otherwise(col("struct.sub_field")) |
| Update multiple sub-fields | Chain multiple .withField() calls |
Spark < 3.1 (no withField) | Reconstruct with struct() listing all sub-fields |
| Access nested values | col("struct.sub_field") |
- Always include
.otherwise()to preserve original values for non-matching rows. - Use dot notation (
col("Date_Of_Birth.Year")) to access nested sub-fields in conditions and selections. - Prefer
withField()(Spark 3.1+) overstruct()reconstruction for cleaner, more maintainable code.