Python PySpark: How to Merge Two DataFrames with Different Columns in PySpark
When working with multiple PySpark DataFrames, you often need to merge (union) them into a single DataFrame. This is straightforward when both DataFrames share the same schema, but becomes challenging when they have different columns. PySpark's union() and related methods require both DataFrames to have the same number of columns - attempting to union DataFrames with mismatched schemas will raise an AnalysisException.
In this guide, you will learn how to handle schema mismatches by adding missing columns to each DataFrame, and then merge them using union(), unionByName(), and the built-in allowMissingColumns parameter available in Spark 3.1+.
Setting Up the Example DataFrames
Let's create two DataFrames with different columns to demonstrate the merging process:
DataFrame 1: Employee Info (3 columns)
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("merge_demo").getOrCreate()
data1 = [
["1", "Sravan", "Kakumanu"],
["2", "Ojaswi", "Hyderabad"],
["3", "Rohith", "Delhi"],
["4", "Sridevi", "Kakumanu"],
["5", "Bobby", "Guntur"],
]
dataframe1 = spark.createDataFrame(data1, ["ID", "Name", "Address"])
dataframe1.show()
Output:
+---+-------+---------+
| ID| Name| Address|
+---+-------+---------+
| 1| Sravan| Kakumanu|
| 2| Ojaswi|Hyderabad|
| 3| Rohith| Delhi|
| 4|Sridevi| Kakumanu|
| 5| Bobby| Guntur|
+---+-------+---------+
DataFrame 2: Employee Age (2 columns)
data2 = [
["1", 23],
["2", 21],
["3", 32],
]
dataframe2 = spark.createDataFrame(data2, ["ID", "Age"])
dataframe2.show()
Output:
+---+---+
| ID|Age|
+---+---+
| 1| 23|
| 2| 21|
| 3| 32|
+---+---+
The Problem: Direct Union Fails
If you try to union these DataFrames directly, PySpark raises an error because the column counts and schemas don't match:
# This will fail!
dataframe1.union(dataframe2).show()
Error:
AnalysisException: Union can only be performed on tables with the same number
of columns, but the first table has 3 columns and the second table has 2 columns
To resolve this, you need to add the missing columns to each DataFrame before merging.
Adding Missing Columns to Both DataFrames
The strategy is simple: for each column that exists in one DataFrame but not the other, add that column filled with null values using lit(None).
from pyspark.sql.functions import lit
# Add columns to dataframe1 that exist only in dataframe2
for column in dataframe2.columns:
if column not in dataframe1.columns:
dataframe1 = dataframe1.withColumn(column, lit(None))
# Add columns to dataframe2 that exist only in dataframe1
for column in dataframe1.columns:
if column not in dataframe2.columns:
dataframe2 = dataframe2.withColumn(column, lit(None))
print("DataFrame 1 columns:", dataframe1.columns)
print("DataFrame 2 columns:", dataframe2.columns)
Output:
DataFrame 1 columns: ['ID', 'Name', 'Address', 'Age']
DataFrame 2 columns: ['ID', 'Age', 'Name', 'Address']
Both DataFrames now have the same four columns, though potentially in different order. This distinction matters depending on which merge method you use.
You can encapsulate this logic in a reusable function:
from pyspark.sql.functions import lit
def align_schemas(df1, df2):
"""Add missing columns (as null) to both DataFrames so they share the same schema."""
for col_name in df2.columns:
if col_name not in df1.columns:
df1 = df1.withColumn(col_name, lit(None))
for col_name in df1.columns:
if col_name not in df2.columns:
df2 = df2.withColumn(col_name, lit(None))
return df1, df2
Method 1: Merging with union()
The union() method merges DataFrames by column position, not by column name. This means the first column of DataFrame 1 is matched with the first column of DataFrame 2, regardless of their names.
merged_df = dataframe1.union(dataframe2)
merged_df.show()
Output:
+---+-------+---------+----+
| ID| Name| Address| Age|
+---+-------+---------+----+
| 1| Sravan| Kakumanu|null|
| 2| Ojaswi|Hyderabad|null|
| 3| Rohith| Delhi|null|
| 4|Sridevi| Kakumanu|null|
| 5| Bobby| Guntur|null|
| 1| 23| null|null|
| 2| 21| null|null|
| 3| 32| null|null|
+---+-------+---------+----+
Notice the problem: because union() matches by position, the Age values (23, 21, 32) from DataFrame 2 ended up in the Name column instead of the Age column. This happens because DataFrame 2's column order is [ID, Age, Name, Address], while DataFrame 1's order is [ID, Name, Address, Age].
Always ensure both DataFrames have columns in the same order when using union(), or use unionByName() instead.
Fixing Column Order for union()
To use union() correctly, reorder DataFrame 2's columns to match DataFrame 1:
# Reorder dataframe2 columns to match dataframe1
dataframe2_reordered = dataframe2.select(dataframe1.columns)
merged_df = dataframe1.union(dataframe2_reordered)
merged_df.show()
Output:
+---+-------+---------+----+
| ID| Name| Address| Age|
+---+-------+---------+----+
| 1| Sravan| Kakumanu|null|
| 2| Ojaswi|Hyderabad|null|
| 3| Rohith| Delhi|null|
| 4|Sridevi| Kakumanu|null|
| 5| Bobby| Guntur|null|
| 1| null| null| 23|
| 2| null| null| 21|
| 3| null| null| 32|
+---+-------+---------+----+
Now the values are correctly placed in their respective columns.
Method 2: Merging with unionByName()
The unionByName() method merges DataFrames by column name, so column order does not matter. This is the recommended approach when columns might be in different positions across DataFrames.
merged_df = dataframe1.unionByName(dataframe2)
merged_df.show()
Output:
+---+-------+---------+----+
| ID| Name| Address| Age|
+---+-------+---------+----+
| 1| Sravan| Kakumanu|null|
| 2| Ojaswi|Hyderabad|null|
| 3| Rohith| Delhi|null|
| 4|Sridevi| Kakumanu|null|
| 5| Bobby| Guntur|null|
| 1| null| null| 23|
| 2| null| null| 21|
| 3| null| null| 32|
+---+-------+---------+----+
The values are correctly matched to their columns regardless of column order.
Method 3: Using unionByName() with allowMissingColumns (Spark 3.1+)
Starting with Spark 3.1, unionByName() supports an allowMissingColumns parameter that automatically fills missing columns with null - eliminating the need to manually add missing columns first.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("merge_demo").getOrCreate()
# Create DataFrames with different schemas (no manual column alignment needed)
data1 = [
["1", "Sravan", "Kakumanu"],
["2", "Ojaswi", "Hyderabad"],
["3", "Rohith", "Delhi"],
]
dataframe1 = spark.createDataFrame(data1, ["ID", "Name", "Address"])
data2 = [
["4", 23],
["5", 21],
]
dataframe2 = spark.createDataFrame(data2, ["ID", "Age"])
# Merge directly: no need to add missing columns manually
merged_df = dataframe1.unionByName(dataframe2, allowMissingColumns=True)
merged_df.show()
Output:
+---+------+---------+----+
| ID| Name| Address| Age|
+---+------+---------+----+
| 1|Sravan| Kakumanu|null|
| 2|Ojaswi|Hyderabad|null|
| 3|Rohith| Delhi|null|
| 4| null| null| 23|
| 5| null| null| 21|
+---+------+---------+----+
This is the cleanest and most concise approach. If you are on Spark 3.1 or later, prefer unionByName(allowMissingColumns=True) over manually adding missing columns. It handles everything automatically and is less error-prone.
Method 4: Using unionAll()
unionAll() is functionally identical to union() in modern versions of PySpark - both merge by position and neither removes duplicates. In earlier Spark versions, union() used to remove duplicates, but this behavior was changed to match unionAll().
# Ensure columns are aligned before using unionAll()
dataframe2_reordered = dataframe2.select(dataframe1.columns)
merged_df = dataframe1.unionAll(dataframe2_reordered)
merged_df.show()
Since unionAll() behaves the same as union() in PySpark, prefer using union() or unionByName() for clarity. unionAll() is maintained for backward compatibility.
Comparison of Merge Methods
| Method | Matches By | Handles Different Column Orders | Auto-fills Missing Columns | Spark Version |
|---|---|---|---|---|
union() | Position | ❌ No | ❌ No | All |
unionAll() | Position | ❌ No | ❌ No | All |
unionByName() | Column name | ✅ Yes | ❌ No | 2.3+ |
unionByName(allowMissingColumns=True) | Column name | ✅ Yes | ✅ Yes | 3.1+ |
Merging More Than Two DataFrames
When merging multiple DataFrames, you can use functools.reduce for a clean approach:
from functools import reduce
from pyspark.sql import DataFrame
dataframes = [dataframe1, dataframe2, dataframe3]
merged_df = reduce(
lambda df1, df2: df1.unionByName(df2, allowMissingColumns=True),
dataframes
)
merged_df.show()
Summary
Merging PySpark DataFrames with different schemas requires handling missing columns. Here are the key takeaways:
- Spark 3.1+: Use
unionByName(allowMissingColumns=True)- the simplest and safest approach that handles everything automatically. - Spark < 3.1: Manually add missing columns with
lit(None)to both DataFrames before merging. - Use
unionByName()instead ofunion()to avoid column-ordering bugs. - Avoid
union()unless you are certain both DataFrames have columns in the exact same order. - Use
reduce()withunionByName()when merging more than two DataFrames.