How to Add a Column to a Nested Struct in Python PySpark
When working with PySpark DataFrames that contain nested structures (structs), you may need to add a new field inside an existing struct column. for example, adding a middle name to a name struct, or adding a year field to a date struct. PySpark provides the .withField() method (available in Spark 3.1+) to accomplish this cleanly.
This guide walks through the process step by step, with practical examples.
Prerequisites
- Apache Spark 3.1+ (the
.withField()method was introduced in Spark 3.1) - PySpark installed (
pip install pyspark) - Basic understanding of PySpark DataFrames and StructType schemas
Understanding Nested Structs
A struct in PySpark is a complex column type that contains multiple sub-fields, similar to a nested dictionary or a row within a row. For example, a Full_Name struct might contain First_Name and Last_Name fields.
from pyspark.sql.types import StructType, StructField, StringType
schema = StructType([
StructField('Full_Name', StructType([
StructField('First_Name', StringType(), True),
StructField('Last_Name', StringType(), True)
])),
StructField('Age', StringType(), True)
])
Step-by-Step Guide
Step 1: Import Required Libraries
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql.functions import col, lit, when
Step 2: Create a Spark Session
spark = SparkSession.builder.appName("NestedStructExample").getOrCreate()
Step 3: Define Schema and Data
# Define the nested schema
schema = StructType([
StructField('Full_Name', StructType([
StructField('First_Name', StringType(), True),
StructField('Last_Name', StringType(), True)
])),
StructField('Age', IntegerType(), True),
StructField('City', StringType(), True)
])
# Define the data
data = [
(('Alice', 'Smith'), 28, 'New York'),
(('Bob', 'Johnson'), 35, 'Chicago'),
(('Charlie', 'Brown'), 22, 'Boston')
]
# Create the DataFrame
df = spark.createDataFrame(data=data, schema=schema)
df.show(truncate=False)
df.printSchema()
Output:
+------------------+---+--------+
|Full_Name |Age|City |
+------------------+---+--------+
|{Alice, Smith} |28 |New York|
|{Bob, Johnson} |35 |Chicago |
|{Charlie, Brown} |22 |Boston |
+------------------+---+--------+
root
|-- Full_Name: struct (nullable = true)
| |-- First_Name: string (nullable = true)
| |-- Last_Name: string (nullable = true)
|-- Age: integer (nullable = true)
|-- City: string (nullable = true)