Skip to main content

How to Add a Column to a Nested Struct in Python PySpark

When working with PySpark DataFrames that contain nested structures (structs), you may need to add a new field inside an existing struct column. for example, adding a middle name to a name struct, or adding a year field to a date struct. PySpark provides the .withField() method (available in Spark 3.1+) to accomplish this cleanly.

This guide walks through the process step by step, with practical examples.

Prerequisites

  • Apache Spark 3.1+ (the .withField() method was introduced in Spark 3.1)
  • PySpark installed (pip install pyspark)
  • Basic understanding of PySpark DataFrames and StructType schemas

Understanding Nested Structs

A struct in PySpark is a complex column type that contains multiple sub-fields, similar to a nested dictionary or a row within a row. For example, a Full_Name struct might contain First_Name and Last_Name fields.

from pyspark.sql.types import StructType, StructField, StringType

schema = StructType([
StructField('Full_Name', StructType([
StructField('First_Name', StringType(), True),
StructField('Last_Name', StringType(), True)
])),
StructField('Age', StringType(), True)
])

Step-by-Step Guide

Step 1: Import Required Libraries

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql.functions import col, lit, when

Step 2: Create a Spark Session

spark = SparkSession.builder.appName("NestedStructExample").getOrCreate()

Step 3: Define Schema and Data

# Define the nested schema
schema = StructType([
StructField('Full_Name', StructType([
StructField('First_Name', StringType(), True),
StructField('Last_Name', StringType(), True)
])),
StructField('Age', IntegerType(), True),
StructField('City', StringType(), True)
])

# Define the data
data = [
(('Alice', 'Smith'), 28, 'New York'),
(('Bob', 'Johnson'), 35, 'Chicago'),
(('Charlie', 'Brown'), 22, 'Boston')
]

# Create the DataFrame
df = spark.createDataFrame(data=data, schema=schema)
df.show(truncate=False)
df.printSchema()

Output:

+------------------+---+--------+
|Full_Name |Age|City |
+------------------+---+--------+
|{Alice, Smith} |28 |New York|
|{Bob, Johnson} |35 |Chicago |
|{Charlie, Brown} |22 |Boston |
+------------------+---+--------+

root
|-- Full_Name: struct (nullable = true)
| |-- First_Name: string (nullable = true)
| |-- Last_Name: string (nullable = true)
|-- Age: integer (nullable = true)
|-- City: string (nullable = true)

Step 4: Add a Column to the Nested Struct

Use .withColumn() combined with .withField() to add a new field inside the struct:

# Add a 'Middle_Name' field to the 'Full_Name' struct
updated_df = df.withColumn(
"Full_Name",
col("Full_Name").withField("Middle_Name", lit("N/A"))
)

updated_df.show(truncate=False)
updated_df.printSchema()

Output:

+-----------------------+---+--------+
|Full_Name |Age|City |
+-----------------------+---+--------+
|{Alice, Smith, N/A} |28 |New York|
|{Bob, Johnson, N/A} |35 |Chicago |
|{Charlie, Brown, N/A} |22 |Boston |
+-----------------------+---+--------+

root
|-- Full_Name: struct (nullable = true)
| |-- First_Name: string (nullable = true)
| |-- Last_Name: string (nullable = true)
| |-- Middle_Name: string (nullable = false)
|-- Age: integer (nullable = true)
|-- City: string (nullable = true)

The Middle_Name field has been added inside the Full_Name struct.

Adding a Conditional Column to a Nested Struct

You can use when().otherwise() to set values based on conditions:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql.functions import col, lit, when

spark = SparkSession.builder.getOrCreate()

schema = StructType([
StructField('Full_Name', StructType([
StructField('First_Name', StringType(), True),
StructField('Last_Name', StringType(), True)
])),
StructField('Gender', StringType(), True),
StructField('Age', IntegerType(), True)
])

data = [
(('Vansh', 'Rai'), 'Male', 20),
(('Ria', 'Kapoor'), 'Female', 22),
(('Amit', 'Sharma'), 'Male', 25),
(('Priya', 'Gupta'), 'Female', 19)
]

df = spark.createDataFrame(data=data, schema=schema)

# Add 'Middle_Name' based on the 'Gender' column
updated_df = df.withColumn(
"Full_Name",
col("Full_Name").withField(
"Middle_Name",
when(col("Gender") == "Male", lit("Singh"))
.otherwise(lit("Kaur"))
)
)

updated_df.show(truncate=False)

Output:

+------------------------+------+---+
|Full_Name |Gender|Age|
+------------------------+------+---+
|{Vansh, Rai, Singh} |Male |20 |
|{Ria, Kapoor, Kaur} |Female|22 |
|{Amit, Sharma, Singh} |Male |25 |
|{Priya, Gupta, Kaur} |Female|19 |
+------------------------+------+---+

Adding a Column Derived from Another Column

You can also derive the new nested field from existing columns:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType
from pyspark.sql.functions import col, lit, when

spark = SparkSession.builder.getOrCreate()

schema = StructType([
StructField('Date_Of_Birth', StructType([
StructField('Day', IntegerType(), True),
StructField('Month', IntegerType(), True)
])),
StructField('Age', IntegerType(), True)
])

data = [
((21, 2), 18),
((16, 4), 20),
((11, 1), 18),
((6, 3), 20)
]

df = spark.createDataFrame(data=data, schema=schema)

# Add 'Year' field to 'Date_Of_Birth' based on 'Age'
updated_df = df.withColumn(
"Date_Of_Birth",
col("Date_Of_Birth").withField(
"Year",
when(col("Age") == 18, lit(2006))
.otherwise(lit(2004))
)
)

updated_df.show(truncate=False)
updated_df.printSchema()

Output:

+-------------+---+
|Date_Of_Birth|Age|
+-------------+---+
|{21, 2, 2006}|18 |
|{16, 4, 2004}|20 |
|{11, 1, 2006}|18 |
|{6, 3, 2004} |20 |
+-------------+---+

root
|-- Date_Of_Birth: struct (nullable = true)
| |-- Day: integer (nullable = true)
| |-- Month: integer (nullable = true)
| |-- Year: integer (nullable = false)
|-- Age: integer (nullable = true)

Adding Multiple Fields to a Struct

Chain multiple .withField() calls to add several fields at once:

updated_df = df.withColumn(
"Full_Name",
col("Full_Name")
.withField("Middle_Name", lit("N/A"))
.withField("Title", lit("Mr."))
.withField("Suffix", lit(""))
)

Alternative: Rebuilding the Struct (For Spark < 3.1)

If you're using a Spark version before 3.1 where .withField() is not available, you can rebuild the struct manually using the struct() function:

from pyspark.sql.functions import col, lit, struct

updated_df = df.withColumn(
"Full_Name",
struct(
col("Full_Name.First_Name").alias("First_Name"),
col("Full_Name.Last_Name").alias("Last_Name"),
lit("N/A").alias("Middle_Name")
)
)
warning

This approach requires you to explicitly list all existing fields in the struct. If you miss any, they'll be dropped from the result. The .withField() method (Spark 3.1+) is much safer as it preserves all existing fields automatically.

Key Syntax Reference

# Add a constant value to a nested struct
df.withColumn("struct_col",
col("struct_col").withField("new_field", lit(value))
)

# Add a conditional value
df.withColumn("struct_col",
col("struct_col").withField("new_field",
when(condition, lit(value1)).otherwise(lit(value2))
)
)

# Add a value derived from another column
df.withColumn("struct_col",
col("struct_col").withField("new_field", col("other_column"))
)

# Drop a field from a struct (Spark 3.1+)
df.withColumn("struct_col",
col("struct_col").dropFields("field_to_remove")
)

Conclusion

Adding a column to a nested struct in PySpark is straightforward with the .withField() method (Spark 3.1+).

Use it with lit() for constant values, when().otherwise() for conditional logic, or col() to derive values from other columns. For older Spark versions, manually rebuild the struct using the struct() function.

This technique is essential when working with complex, nested data formats commonly found in JSON-based data pipelines and data lake architectures.