Python PySpark: How to Merge Two DataFrames With Different Columns in PySpark
When working with multiple PySpark DataFrames, you frequently need to combine them vertically (stacking rows). PySpark's union() and unionByName() operations require both DataFrames to have the same set of columns. If the DataFrames have different columns, the union will fail or produce incorrect results.
In this guide, you will learn how to handle this scenario by adding missing columns to each DataFrame before performing the union, using three different methods: union(), unionAll(), and unionByName().
The Problem: Different Column Sets
Consider two DataFrames with different columns:
DataFrame 1 - has columns ID, NAME, Address:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('merge_demo').getOrCreate()
data1 = [
["1", "James", "New York"],
["2", "Emily", "Los Angeles"],
["3", "Michael", "Chicago"],
["4", "Sarah", "Houston"],
["5", "David", "Phoenix"]
]
dataframe1 = spark.createDataFrame(data1, ['ID', 'NAME', 'Address'])
dataframe1.show()
Output:
+---+-------+-----------+
| ID| NAME| Address|
+---+-------+-----------+
| 1| James| New York|
| 2| Emily|Los Angeles|
| 3|Michael| Chicago|
| 4| Sarah| Houston|
| 5| David| Phoenix|
+---+-------+-----------+
DataFrame 2 - has columns ID, Age:
data2 = [
["1", 23],
["2", 21],
["3", 32]
]
dataframe2 = spark.createDataFrame(data2, ['ID', 'Age'])
dataframe2.show()
Output:
+---+---+
| ID|Age|
+---+---+
| 1| 23|
| 2| 21|
| 3| 32|
+---+---+
Attempting to union these directly fails:
# ❌ This raises an error because the column counts differ
dataframe1.union(dataframe2).show()
AnalysisException: Union can only be performed on tables with
the same number of columns, but the first table has 3 columns
and the second table has 2 columns
The Solution: Add Missing Columns With lit(None)
Before performing the union, add the missing columns to each DataFrame using lit(None) to fill them with null values:
from pyspark.sql.functions import lit
# Add columns from dataframe2 that are missing in dataframe1
for column in dataframe2.columns:
if column not in dataframe1.columns:
dataframe1 = dataframe1.withColumn(column, lit(None))
# Add columns from dataframe1 that are missing in dataframe2
for column in dataframe1.columns:
if column not in dataframe2.columns:
dataframe2 = dataframe2.withColumn(column, lit(None))
print("DataFrame 1 columns:", dataframe1.columns)
print("DataFrame 2 columns:", dataframe2.columns)
Output:
DataFrame 1 columns: ['ID', 'NAME', 'Address', 'Age']
DataFrame 2 columns: ['ID', 'Age', 'NAME', 'Address']
Now both DataFrames have the same four columns. The missing values are filled with null.
Since this pattern is used frequently, wrap it in a helper function:
from pyspark.sql.functions import lit
def align_columns(df1, df2):
"""Add missing columns to both DataFrames so they share the same schema."""
for col in df2.columns:
if col not in df1.columns:
df1 = df1.withColumn(col, lit(None))
for col in df1.columns:
if col not in df2.columns:
df2 = df2.withColumn(col, lit(None))
return df1, df2
Method 1: Using union()
union() merges two DataFrames by column position (not by name). After aligning the columns, call union() to stack the rows:
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
spark = SparkSession.builder.appName('merge_demo').getOrCreate()
# Create DataFrame 1
data1 = [
["1", "James", "New York"],
["2", "Emily", "Los Angeles"],
["3", "Michael", "Chicago"],
["4", "Sarah", "Houston"],
["5", "David", "Phoenix"]
]
dataframe1 = spark.createDataFrame(data1, ['ID', 'NAME', 'Address'])
# Create DataFrame 2
data2 = [["1", 23], ["2", 21], ["3", 32]]
dataframe2 = spark.createDataFrame(data2, ['ID', 'Age'])
# Align columns
for col in dataframe2.columns:
if col not in dataframe1.columns:
dataframe1 = dataframe1.withColumn(col, lit(None))
for col in dataframe1.columns:
if col not in dataframe2.columns:
dataframe2 = dataframe2.withColumn(col, lit(None))
# Perform union
dataframe1.union(dataframe2).show()
Output:
+---+-------+-----------+----+
| ID| NAME| Address| Age|
+---+-------+-----------+----+
| 1| James| New York|null|
| 2| Emily|Los Angeles|null|
| 3|Michael| Chicago|null|
| 4| Sarah| Houston|null|
| 5| David| Phoenix|null|
| 1| null| null| 23|
| 2| null| null| 21|
| 3| null| null| 32|
+---+-------+-----------+----+
union() matches columns by position, not by nameSince union() matches columns by their ordinal position, the column order in both DataFrames must be the same. Notice that after adding missing columns, dataframe1 has columns ['ID', 'NAME', 'Address', 'Age'] while dataframe2 has ['ID', 'Age', 'NAME', 'Address']. In this case, union() may incorrectly align Age with NAME.
Fix: Either reorder columns before the union or use unionByName() instead:
# Reorder dataframe2 columns to match dataframe1
dataframe2 = dataframe2.select(dataframe1.columns)
dataframe1.union(dataframe2).show()
Method 2: Using unionAll()
unionAll() works identically to union() in modern versions of PySpark - both return all rows including duplicates. In earlier versions, union() removed duplicates while unionAll() kept them, but this distinction no longer exists:
# Reorder columns to match, then union
dataframe2_aligned = dataframe2.select(dataframe1.columns)
dataframe1.unionAll(dataframe2_aligned).show()
Output:
+---+-------+-----------+----+
| ID| NAME| Address| Age|
+---+-------+-----------+----+
| 1| James| New York|null|
| 2| Emily|Los Angeles|null|
| 3|Michael| Chicago|null|
| 4| Sarah| Houston|null|
| 5| David| Phoenix|null|
| 1| null| null| 23|
| 2| null| null| 21|
| 3| null| null| 32|
+---+-------+-----------+----+
Method 3: Using unionByName() - The Safest Approach
unionByName() matches columns by name rather than position, making it the safest method when column orders differ between DataFrames:
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
spark = SparkSession.builder.appName('merge_demo').getOrCreate()
# Create DataFrames
dataframe1 = spark.createDataFrame(
[["1", "James", "New York"], ["2", "Emily", "Los Angeles"],
["3", "Michael", "Chicago"], ["4", "Sarah", "Houston"],
["5", "David", "Phoenix"]],
['ID', 'NAME', 'Address']
)
dataframe2 = spark.createDataFrame(
[["1", 23], ["2", 21], ["3", 32]],
['ID', 'Age']
)
# Align columns
for col in dataframe2.columns:
if col not in dataframe1.columns:
dataframe1 = dataframe1.withColumn(col, lit(None))
for col in dataframe1.columns:
if col not in dataframe2.columns:
dataframe2 = dataframe2.withColumn(col, lit(None))
# Perform unionByName: no need to worry about column order
dataframe1.unionByName(dataframe2).show()
Output:
+---+-------+-----------+----+
| ID| NAME| Address| Age|
+---+-------+-----------+----+
| 1| James| New York|null|
| 2| Emily|Los Angeles|null|
| 3|Michael| Chicago|null|
| 4| Sarah| Houston|null|
| 5| David| Phoenix|null|
| 1| null| null| 23|
| 2| null| null| 21|
| 3| null| null| 32|
+---+-------+-----------+----+
PySpark 3.1+: Built-In allowMissingColumns
Starting with PySpark 3.1, unionByName() supports an allowMissingColumns parameter that automatically handles missing columns - no manual alignment needed:
# PySpark 3.1+ only: automatically fills missing columns with null
dataframe1.unionByName(dataframe2, allowMissingColumns=True).show()
Output:
+---+-------+-----------+----+
| ID| NAME| Address| Age|
+---+-------+-----------+----+
| 1| James| New York|null|
| 2| Emily|Los Angeles|null|
| 3|Michael| Chicago|null|
| 4| Sarah| Houston|null|
| 5| David| Phoenix|null|
| 1| null| null| 23|
| 2| null| null| 21|
| 3| null| null| 32|
+---+-------+-----------+----+
If you are using PySpark 3.1 or later, unionByName(allowMissingColumns=True) is the simplest and most reliable approach - it eliminates the need for manual column alignment entirely.
Comparison of Methods
| Method | Column Matching | Handles Missing Columns | Column Order Matters? |
|---|---|---|---|
union() | By position | ❌ Must align manually | ✅ Yes - order must match |
unionAll() | By position | ❌ Must align manually | ✅ Yes - order must match |
unionByName() | By name | ❌ Must align manually (pre-3.1) | ❌ No |
unionByName(allowMissingColumns=True) | By name | ✅ Automatic (PySpark 3.1+) | ❌ No |
Complete Reusable Solution
Here is a complete, production-ready function that works across PySpark versions:
from pyspark.sql.functions import lit
def merge_dataframes(df1, df2):
"""
Merge two PySpark DataFrames with potentially different columns.
Missing columns are filled with null values.
Uses unionByName for safe column matching.
"""
# Add missing columns to df1
for col in df2.columns:
if col not in df1.columns:
df1 = df1.withColumn(col, lit(None))
# Add missing columns to df2
for col in df1.columns:
if col not in df2.columns:
df2 = df2.withColumn(col, lit(None))
return df1.unionByName(df2)
# Usage
result = merge_dataframes(dataframe1, dataframe2)
result.show()
Conclusion
Merging PySpark DataFrames with different columns requires adding the missing columns (filled with null) to each DataFrame before performing the union.
The safest approach is unionByName(), which matches columns by name regardless of order.
If you are using PySpark 3.1+, the built-in allowMissingColumns=True parameter handles everything automatically, making it the simplest and most reliable solution.
For older PySpark versions, manually aligning columns with lit(None) and using unionByName() is the recommended pattern.