Skip to main content

Python PySpark: How to Merge Two DataFrames with Different Columns in PySpark

When working with multiple PySpark DataFrames, you often need to merge (union) them into a single DataFrame. This is straightforward when both DataFrames share the same schema, but becomes challenging when they have different columns. PySpark's union() and related methods require both DataFrames to have the same number of columns - attempting to union DataFrames with mismatched schemas will raise an AnalysisException.

In this guide, you will learn how to handle schema mismatches by adding missing columns to each DataFrame, and then merge them using union(), unionByName(), and the built-in allowMissingColumns parameter available in Spark 3.1+.

Setting Up the Example DataFrames

Let's create two DataFrames with different columns to demonstrate the merging process:

DataFrame 1: Employee Info (3 columns)

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("merge_demo").getOrCreate()

data1 = [
["1", "Sravan", "Kakumanu"],
["2", "Ojaswi", "Hyderabad"],
["3", "Rohith", "Delhi"],
["4", "Sridevi", "Kakumanu"],
["5", "Bobby", "Guntur"],
]

dataframe1 = spark.createDataFrame(data1, ["ID", "Name", "Address"])
dataframe1.show()

Output:

+---+-------+---------+
| ID| Name| Address|
+---+-------+---------+
| 1| Sravan| Kakumanu|
| 2| Ojaswi|Hyderabad|
| 3| Rohith| Delhi|
| 4|Sridevi| Kakumanu|
| 5| Bobby| Guntur|
+---+-------+---------+

DataFrame 2: Employee Age (2 columns)

data2 = [
["1", 23],
["2", 21],
["3", 32],
]

dataframe2 = spark.createDataFrame(data2, ["ID", "Age"])
dataframe2.show()

Output:

+---+---+
| ID|Age|
+---+---+
| 1| 23|
| 2| 21|
| 3| 32|
+---+---+

The Problem: Direct Union Fails

If you try to union these DataFrames directly, PySpark raises an error because the column counts and schemas don't match:

# This will fail!
dataframe1.union(dataframe2).show()

Error:

AnalysisException: Union can only be performed on tables with the same number
of columns, but the first table has 3 columns and the second table has 2 columns

To resolve this, you need to add the missing columns to each DataFrame before merging.

Adding Missing Columns to Both DataFrames

The strategy is simple: for each column that exists in one DataFrame but not the other, add that column filled with null values using lit(None).

from pyspark.sql.functions import lit

# Add columns to dataframe1 that exist only in dataframe2
for column in dataframe2.columns:
if column not in dataframe1.columns:
dataframe1 = dataframe1.withColumn(column, lit(None))

# Add columns to dataframe2 that exist only in dataframe1
for column in dataframe1.columns:
if column not in dataframe2.columns:
dataframe2 = dataframe2.withColumn(column, lit(None))

print("DataFrame 1 columns:", dataframe1.columns)
print("DataFrame 2 columns:", dataframe2.columns)

Output:

DataFrame 1 columns: ['ID', 'Name', 'Address', 'Age']
DataFrame 2 columns: ['ID', 'Age', 'Name', 'Address']

Both DataFrames now have the same four columns, though potentially in different order. This distinction matters depending on which merge method you use.

tip

You can encapsulate this logic in a reusable function:

from pyspark.sql.functions import lit

def align_schemas(df1, df2):
"""Add missing columns (as null) to both DataFrames so they share the same schema."""
for col_name in df2.columns:
if col_name not in df1.columns:
df1 = df1.withColumn(col_name, lit(None))
for col_name in df1.columns:
if col_name not in df2.columns:
df2 = df2.withColumn(col_name, lit(None))
return df1, df2

Method 1: Merging with union()

The union() method merges DataFrames by column position, not by column name. This means the first column of DataFrame 1 is matched with the first column of DataFrame 2, regardless of their names.

merged_df = dataframe1.union(dataframe2)
merged_df.show()

Output:

+---+-------+---------+----+
| ID| Name| Address| Age|
+---+-------+---------+----+
| 1| Sravan| Kakumanu|null|
| 2| Ojaswi|Hyderabad|null|
| 3| Rohith| Delhi|null|
| 4|Sridevi| Kakumanu|null|
| 5| Bobby| Guntur|null|
| 1| 23| null|null|
| 2| 21| null|null|
| 3| 32| null|null|
+---+-------+---------+----+
caution

Notice the problem: because union() matches by position, the Age values (23, 21, 32) from DataFrame 2 ended up in the Name column instead of the Age column. This happens because DataFrame 2's column order is [ID, Age, Name, Address], while DataFrame 1's order is [ID, Name, Address, Age].

Always ensure both DataFrames have columns in the same order when using union(), or use unionByName() instead.

Fixing Column Order for union()

To use union() correctly, reorder DataFrame 2's columns to match DataFrame 1:

# Reorder dataframe2 columns to match dataframe1
dataframe2_reordered = dataframe2.select(dataframe1.columns)

merged_df = dataframe1.union(dataframe2_reordered)
merged_df.show()

Output:

+---+-------+---------+----+
| ID| Name| Address| Age|
+---+-------+---------+----+
| 1| Sravan| Kakumanu|null|
| 2| Ojaswi|Hyderabad|null|
| 3| Rohith| Delhi|null|
| 4|Sridevi| Kakumanu|null|
| 5| Bobby| Guntur|null|
| 1| null| null| 23|
| 2| null| null| 21|
| 3| null| null| 32|
+---+-------+---------+----+

Now the values are correctly placed in their respective columns.

Method 2: Merging with unionByName()

The unionByName() method merges DataFrames by column name, so column order does not matter. This is the recommended approach when columns might be in different positions across DataFrames.

merged_df = dataframe1.unionByName(dataframe2)
merged_df.show()

Output:

+---+-------+---------+----+
| ID| Name| Address| Age|
+---+-------+---------+----+
| 1| Sravan| Kakumanu|null|
| 2| Ojaswi|Hyderabad|null|
| 3| Rohith| Delhi|null|
| 4|Sridevi| Kakumanu|null|
| 5| Bobby| Guntur|null|
| 1| null| null| 23|
| 2| null| null| 21|
| 3| null| null| 32|
+---+-------+---------+----+

The values are correctly matched to their columns regardless of column order.

Method 3: Using unionByName() with allowMissingColumns (Spark 3.1+)

Starting with Spark 3.1, unionByName() supports an allowMissingColumns parameter that automatically fills missing columns with null - eliminating the need to manually add missing columns first.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("merge_demo").getOrCreate()

# Create DataFrames with different schemas (no manual column alignment needed)
data1 = [
["1", "Sravan", "Kakumanu"],
["2", "Ojaswi", "Hyderabad"],
["3", "Rohith", "Delhi"],
]
dataframe1 = spark.createDataFrame(data1, ["ID", "Name", "Address"])

data2 = [
["4", 23],
["5", 21],
]
dataframe2 = spark.createDataFrame(data2, ["ID", "Age"])

# Merge directly: no need to add missing columns manually
merged_df = dataframe1.unionByName(dataframe2, allowMissingColumns=True)
merged_df.show()

Output:

+---+------+---------+----+
| ID| Name| Address| Age|
+---+------+---------+----+
| 1|Sravan| Kakumanu|null|
| 2|Ojaswi|Hyderabad|null|
| 3|Rohith| Delhi|null|
| 4| null| null| 23|
| 5| null| null| 21|
+---+------+---------+----+
info

This is the cleanest and most concise approach. If you are on Spark 3.1 or later, prefer unionByName(allowMissingColumns=True) over manually adding missing columns. It handles everything automatically and is less error-prone.

Method 4: Using unionAll()

unionAll() is functionally identical to union() in modern versions of PySpark - both merge by position and neither removes duplicates. In earlier Spark versions, union() used to remove duplicates, but this behavior was changed to match unionAll().

# Ensure columns are aligned before using unionAll()
dataframe2_reordered = dataframe2.select(dataframe1.columns)
merged_df = dataframe1.unionAll(dataframe2_reordered)
merged_df.show()
note

Since unionAll() behaves the same as union() in PySpark, prefer using union() or unionByName() for clarity. unionAll() is maintained for backward compatibility.

Comparison of Merge Methods

MethodMatches ByHandles Different Column OrdersAuto-fills Missing ColumnsSpark Version
union()Position❌ No❌ NoAll
unionAll()Position❌ No❌ NoAll
unionByName()Column name✅ Yes❌ No2.3+
unionByName(allowMissingColumns=True)Column name✅ Yes✅ Yes3.1+

Merging More Than Two DataFrames

When merging multiple DataFrames, you can use functools.reduce for a clean approach:

from functools import reduce
from pyspark.sql import DataFrame

dataframes = [dataframe1, dataframe2, dataframe3]

merged_df = reduce(
lambda df1, df2: df1.unionByName(df2, allowMissingColumns=True),
dataframes
)

merged_df.show()

Summary

Merging PySpark DataFrames with different schemas requires handling missing columns. Here are the key takeaways:

  • Spark 3.1+: Use unionByName(allowMissingColumns=True) - the simplest and safest approach that handles everything automatically.
  • Spark < 3.1: Manually add missing columns with lit(None) to both DataFrames before merging.
  • Use unionByName() instead of union() to avoid column-ordering bugs.
  • Avoid union() unless you are certain both DataFrames have columns in the exact same order.
  • Use reduce() with unionByName() when merging more than two DataFrames.