How to Add a Column to a Nested Struct in Python PySpark
When working with PySpark DataFrames that contain nested structures (structs), you may need to add a new field inside an existing struct column. for example, adding a middle name to a name struct, or adding a year field to a date struct. PySpark provides the .withField() method (available in Spark 3.1+) to accomplish this cleanly.
This guide walks through the process step by step, with practical examples.
Prerequisites
- Apache Spark 3.1+ (the
.withField()method was introduced in Spark 3.1) - PySpark installed (
pip install pyspark) - Basic understanding of PySpark DataFrames and StructType schemas
Understanding Nested Structs
A struct in PySpark is a complex column type that contains multiple sub-fields, similar to a nested dictionary or a row within a row. For example, a Full_Name struct might contain First_Name and Last_Name fields.
from pyspark.sql.types import StructType, StructField, StringType
schema = StructType([
StructField('Full_Name', StructType([
StructField('First_Name', StringType(), True),
StructField('Last_Name', StringType(), True)
])),
StructField('Age', StringType(), True)
])
Step-by-Step Guide
Step 1: Import Required Libraries
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql.functions import col, lit, when
Step 2: Create a Spark Session
spark = SparkSession.builder.appName("NestedStructExample").getOrCreate()
Step 3: Define Schema and Data
# Define the nested schema
schema = StructType([
StructField('Full_Name', StructType([
StructField('First_Name', StringType(), True),
StructField('Last_Name', StringType(), True)
])),
StructField('Age', IntegerType(), True),
StructField('City', StringType(), True)
])
# Define the data
data = [
(('Alice', 'Smith'), 28, 'New York'),
(('Bob', 'Johnson'), 35, 'Chicago'),
(('Charlie', 'Brown'), 22, 'Boston')
]
# Create the DataFrame
df = spark.createDataFrame(data=data, schema=schema)
df.show(truncate=False)
df.printSchema()
Output:
+------------------+---+--------+
|Full_Name |Age|City |
+------------------+---+--------+
|{Alice, Smith} |28 |New York|
|{Bob, Johnson} |35 |Chicago |
|{Charlie, Brown} |22 |Boston |
+------------------+---+--------+
root
|-- Full_Name: struct (nullable = true)
| |-- First_Name: string (nullable = true)
| |-- Last_Name: string (nullable = true)
|-- Age: integer (nullable = true)
|-- City: string (nullable = true)
Step 4: Add a Column to the Nested Struct
Use .withColumn() combined with .withField() to add a new field inside the struct:
# Add a 'Middle_Name' field to the 'Full_Name' struct
updated_df = df.withColumn(
"Full_Name",
col("Full_Name").withField("Middle_Name", lit("N/A"))
)
updated_df.show(truncate=False)
updated_df.printSchema()
Output:
+-----------------------+---+--------+
|Full_Name |Age|City |
+-----------------------+---+--------+
|{Alice, Smith, N/A} |28 |New York|
|{Bob, Johnson, N/A} |35 |Chicago |
|{Charlie, Brown, N/A} |22 |Boston |
+-----------------------+---+--------+
root
|-- Full_Name: struct (nullable = true)
| |-- First_Name: string (nullable = true)
| |-- Last_Name: string (nullable = true)
| |-- Middle_Name: string (nullable = false)
|-- Age: integer (nullable = true)
|-- City: string (nullable = true)
The Middle_Name field has been added inside the Full_Name struct.
Adding a Conditional Column to a Nested Struct
You can use when().otherwise() to set values based on conditions:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql.functions import col, lit, when
spark = SparkSession.builder.getOrCreate()
schema = StructType([
StructField('Full_Name', StructType([
StructField('First_Name', StringType(), True),
StructField('Last_Name', StringType(), True)
])),
StructField('Gender', StringType(), True),
StructField('Age', IntegerType(), True)
])
data = [
(('Vansh', 'Rai'), 'Male', 20),
(('Ria', 'Kapoor'), 'Female', 22),
(('Amit', 'Sharma'), 'Male', 25),
(('Priya', 'Gupta'), 'Female', 19)
]
df = spark.createDataFrame(data=data, schema=schema)
# Add 'Middle_Name' based on the 'Gender' column
updated_df = df.withColumn(
"Full_Name",
col("Full_Name").withField(
"Middle_Name",
when(col("Gender") == "Male", lit("Singh"))
.otherwise(lit("Kaur"))
)
)
updated_df.show(truncate=False)
Output:
+------------------------+------+---+
|Full_Name |Gender|Age|
+------------------------+------+---+
|{Vansh, Rai, Singh} |Male |20 |
|{Ria, Kapoor, Kaur} |Female|22 |
|{Amit, Sharma, Singh} |Male |25 |
|{Priya, Gupta, Kaur} |Female|19 |
+------------------------+------+---+
Adding a Column Derived from Another Column
You can also derive the new nested field from existing columns:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType
from pyspark.sql.functions import col, lit, when
spark = SparkSession.builder.getOrCreate()
schema = StructType([
StructField('Date_Of_Birth', StructType([
StructField('Day', IntegerType(), True),
StructField('Month', IntegerType(), True)
])),
StructField('Age', IntegerType(), True)
])
data = [
((21, 2), 18),
((16, 4), 20),
((11, 1), 18),
((6, 3), 20)
]
df = spark.createDataFrame(data=data, schema=schema)
# Add 'Year' field to 'Date_Of_Birth' based on 'Age'
updated_df = df.withColumn(
"Date_Of_Birth",
col("Date_Of_Birth").withField(
"Year",
when(col("Age") == 18, lit(2006))
.otherwise(lit(2004))
)
)
updated_df.show(truncate=False)
updated_df.printSchema()
Output:
+-------------+---+
|Date_Of_Birth|Age|
+-------------+---+
|{21, 2, 2006}|18 |
|{16, 4, 2004}|20 |
|{11, 1, 2006}|18 |
|{6, 3, 2004} |20 |
+-------------+---+
root
|-- Date_Of_Birth: struct (nullable = true)
| |-- Day: integer (nullable = true)
| |-- Month: integer (nullable = true)
| |-- Year: integer (nullable = false)
|-- Age: integer (nullable = true)
Adding Multiple Fields to a Struct
Chain multiple .withField() calls to add several fields at once:
updated_df = df.withColumn(
"Full_Name",
col("Full_Name")
.withField("Middle_Name", lit("N/A"))
.withField("Title", lit("Mr."))
.withField("Suffix", lit(""))
)
Alternative: Rebuilding the Struct (For Spark < 3.1)
If you're using a Spark version before 3.1 where .withField() is not available, you can rebuild the struct manually using the struct() function:
from pyspark.sql.functions import col, lit, struct
updated_df = df.withColumn(
"Full_Name",
struct(
col("Full_Name.First_Name").alias("First_Name"),
col("Full_Name.Last_Name").alias("Last_Name"),
lit("N/A").alias("Middle_Name")
)
)
This approach requires you to explicitly list all existing fields in the struct. If you miss any, they'll be dropped from the result. The .withField() method (Spark 3.1+) is much safer as it preserves all existing fields automatically.
Key Syntax Reference
# Add a constant value to a nested struct
df.withColumn("struct_col",
col("struct_col").withField("new_field", lit(value))
)
# Add a conditional value
df.withColumn("struct_col",
col("struct_col").withField("new_field",
when(condition, lit(value1)).otherwise(lit(value2))
)
)
# Add a value derived from another column
df.withColumn("struct_col",
col("struct_col").withField("new_field", col("other_column"))
)
# Drop a field from a struct (Spark 3.1+)
df.withColumn("struct_col",
col("struct_col").dropFields("field_to_remove")
)
Conclusion
Adding a column to a nested struct in PySpark is straightforward with the .withField() method (Spark 3.1+).
Use it with lit() for constant values, when().otherwise() for conditional logic, or col() to derive values from other columns. For older Spark versions, manually rebuild the struct using the struct() function.
This technique is essential when working with complex, nested data formats commonly found in JSON-based data pipelines and data lake architectures.