Python PySpark: How to Merge Two DataFrames With Different Columns in PySpark

When working with multiple PySpark DataFrames, you frequently need to combine them vertically (stacking rows). PySpark's union() and unionByName() operations require both DataFrames to have the same set of columns. If the DataFrames have different columns, the union will fail or produce incorrect results.

In this guide, you will learn how to handle this scenario by adding missing columns to each DataFrame before performing the union, using three different methods: union(), unionAll(), and unionByName().

The Problem: Different Column Sets

Consider two DataFrames with different columns:

DataFrame 1 - has columns ID, NAME, Address:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('merge_demo').getOrCreate()

data1 = [
    ["1", "James", "New York"],
    ["2", "Emily", "Los Angeles"],
    ["3", "Michael", "Chicago"],
    ["4", "Sarah", "Houston"],
    ["5", "David", "Phoenix"]
]

dataframe1 = spark.createDataFrame(data1, ['ID', 'NAME', 'Address'])
dataframe1.show()

Output:

+---+-------+-----------+
| ID|   NAME|    Address|
+---+-------+-----------+
|  1|  James|   New York|
|  2|  Emily|Los Angeles|
|  3|Michael|    Chicago|
|  4|  Sarah|    Houston|
|  5|  David|    Phoenix|
+---+-------+-----------+

DataFrame 2 - has columns ID, Age:

data2 = [
    ["1", 23],
    ["2", 21],
    ["3", 32]
]

dataframe2 = spark.createDataFrame(data2, ['ID', 'Age'])
dataframe2.show()

Output:

+---+---+
| ID|Age|
+---+---+
|  1| 23|
|  2| 21|
|  3| 32|
+---+---+

Attempting to union these directly fails:

# ❌ This raises an error because the column counts differ
dataframe1.union(dataframe2).show()

AnalysisException: Union can only be performed on tables with 
the same number of columns, but the first table has 3 columns 
and the second table has 2 columns

The Solution: Add Missing Columns With `lit(None)`

Before performing the union, add the missing columns to each DataFrame using lit(None) to fill them with null values:

from pyspark.sql.functions import lit

# Add columns from dataframe2 that are missing in dataframe1
for column in dataframe2.columns:
    if column not in dataframe1.columns:
        dataframe1 = dataframe1.withColumn(column, lit(None))

# Add columns from dataframe1 that are missing in dataframe2
for column in dataframe1.columns:
    if column not in dataframe2.columns:
        dataframe2 = dataframe2.withColumn(column, lit(None))

print("DataFrame 1 columns:", dataframe1.columns)
print("DataFrame 2 columns:", dataframe2.columns)

Output:

DataFrame 1 columns: ['ID', 'NAME', 'Address', 'Age']
DataFrame 2 columns: ['ID', 'Age', 'NAME', 'Address']

Now both DataFrames have the same four columns. The missing values are filled with null.

Create a reusable function

Since this pattern is used frequently, wrap it in a helper function:

from pyspark.sql.functions import lit

def align_columns(df1, df2):
    """Add missing columns to both DataFrames so they share the same schema."""
    for col in df2.columns:
        if col not in df1.columns:
            df1 = df1.withColumn(col, lit(None))

    for col in df1.columns:
        if col not in df2.columns:
            df2 = df2.withColumn(col, lit(None))

    return df1, df2

Method 1: Using `union()`

union() merges two DataFrames by column position (not by name). After aligning the columns, call union() to stack the rows:

from pyspark.sql import SparkSession
from pyspark.sql.functions import lit

spark = SparkSession.builder.appName('merge_demo').getOrCreate()

# Create DataFrame 1
data1 = [
    ["1", "James", "New York"],
    ["2", "Emily", "Los Angeles"],
    ["3", "Michael", "Chicago"],
    ["4", "Sarah", "Houston"],
    ["5", "David", "Phoenix"]
]
dataframe1 = spark.createDataFrame(data1, ['ID', 'NAME', 'Address'])

# Create DataFrame 2
data2 = [["1", 23], ["2", 21], ["3", 32]]
dataframe2 = spark.createDataFrame(data2, ['ID', 'Age'])

# Align columns
for col in dataframe2.columns:
    if col not in dataframe1.columns:
        dataframe1 = dataframe1.withColumn(col, lit(None))

for col in dataframe1.columns:
    if col not in dataframe2.columns:
        dataframe2 = dataframe2.withColumn(col, lit(None))

# Perform union
dataframe1.union(dataframe2).show()

Output:

+---+-------+-----------+----+
| ID|   NAME|    Address| Age|
+---+-------+-----------+----+
|  1|  James|   New York|null|
|  2|  Emily|Los Angeles|null|
|  3|Michael|    Chicago|null|
|  4|  Sarah|    Houston|null|
|  5|  David|    Phoenix|null|
|  1|   null|       null|  23|
|  2|   null|       null|  21|
|  3|   null|       null|  32|
+---+-------+-----------+----+

union() matches columns by position, not by name

Since union() matches columns by their ordinal position, the column order in both DataFrames must be the same. Notice that after adding missing columns, dataframe1 has columns ['ID', 'NAME', 'Address', 'Age'] while dataframe2 has ['ID', 'Age', 'NAME', 'Address']. In this case, union() may incorrectly align Age with NAME.

Fix: Either reorder columns before the union or use unionByName() instead:

# Reorder dataframe2 columns to match dataframe1
dataframe2 = dataframe2.select(dataframe1.columns)
dataframe1.union(dataframe2).show()

Method 2: Using `unionAll()`

unionAll() works identically to union() in modern versions of PySpark - both return all rows including duplicates. In earlier versions, union() removed duplicates while unionAll() kept them, but this distinction no longer exists:

# Reorder columns to match, then union
dataframe2_aligned = dataframe2.select(dataframe1.columns)
dataframe1.unionAll(dataframe2_aligned).show()

Output:

+---+-------+-----------+----+
| ID|   NAME|    Address| Age|
+---+-------+-----------+----+
|  1|  James|   New York|null|
|  2|  Emily|Los Angeles|null|
|  3|Michael|    Chicago|null|
|  4|  Sarah|    Houston|null|
|  5|  David|    Phoenix|null|
|  1|   null|       null|  23|
|  2|   null|       null|  21|
|  3|   null|       null|  32|
+---+-------+-----------+----+

Method 3: Using `unionByName()` - The Safest Approach

unionByName() matches columns by name rather than position, making it the safest method when column orders differ between DataFrames:

from pyspark.sql import SparkSession
from pyspark.sql.functions import lit

spark = SparkSession.builder.appName('merge_demo').getOrCreate()

# Create DataFrames
dataframe1 = spark.createDataFrame(
    [["1", "James", "New York"], ["2", "Emily", "Los Angeles"],
     ["3", "Michael", "Chicago"], ["4", "Sarah", "Houston"],
     ["5", "David", "Phoenix"]],
    ['ID', 'NAME', 'Address']
)

dataframe2 = spark.createDataFrame(
    [["1", 23], ["2", 21], ["3", 32]],
    ['ID', 'Age']
)

# Align columns
for col in dataframe2.columns:
    if col not in dataframe1.columns:
        dataframe1 = dataframe1.withColumn(col, lit(None))

for col in dataframe1.columns:
    if col not in dataframe2.columns:
        dataframe2 = dataframe2.withColumn(col, lit(None))

# Perform unionByName: no need to worry about column order
dataframe1.unionByName(dataframe2).show()

Output:

+---+-------+-----------+----+
| ID|   NAME|    Address| Age|
+---+-------+-----------+----+
|  1|  James|   New York|null|
|  2|  Emily|Los Angeles|null|
|  3|Michael|    Chicago|null|
|  4|  Sarah|    Houston|null|
|  5|  David|    Phoenix|null|
|  1|   null|       null|  23|
|  2|   null|       null|  21|
|  3|   null|       null|  32|
+---+-------+-----------+----+

PySpark 3.1+: Built-In `allowMissingColumns`

Starting with PySpark 3.1, unionByName() supports an allowMissingColumns parameter that automatically handles missing columns - no manual alignment needed:

# PySpark 3.1+ only: automatically fills missing columns with null
dataframe1.unionByName(dataframe2, allowMissingColumns=True).show()

Output:

+---+-------+-----------+----+
| ID|   NAME|    Address| Age|
+---+-------+-----------+----+
|  1|  James|   New York|null|
|  2|  Emily|Los Angeles|null|
|  3|Michael|    Chicago|null|
|  4|  Sarah|    Houston|null|
|  5|  David|    Phoenix|null|
|  1|   null|       null|  23|
|  2|   null|       null|  21|
|  3|   null|       null|  32|
+---+-------+-----------+----+

info

If you are using PySpark 3.1 or later, unionByName(allowMissingColumns=True) is the simplest and most reliable approach - it eliminates the need for manual column alignment entirely.

Comparison of Methods

Method	Column Matching	Handles Missing Columns	Column Order Matters?
`union()`	By position	❌ Must align manually	✅ Yes - order must match
`unionAll()`	By position	❌ Must align manually	✅ Yes - order must match
`unionByName()`	By name	❌ Must align manually (pre-3.1)	❌ No
`unionByName(allowMissingColumns=True)`	By name	✅ Automatic (PySpark 3.1+)	❌ No

Complete Reusable Solution

Here is a complete, production-ready function that works across PySpark versions:

from pyspark.sql.functions import lit

def merge_dataframes(df1, df2):
    """
    Merge two PySpark DataFrames with potentially different columns.
    Missing columns are filled with null values.
    Uses unionByName for safe column matching.
    """
    # Add missing columns to df1
    for col in df2.columns:
        if col not in df1.columns:
            df1 = df1.withColumn(col, lit(None))

    # Add missing columns to df2
    for col in df1.columns:
        if col not in df2.columns:
            df2 = df2.withColumn(col, lit(None))

    return df1.unionByName(df2)


# Usage
result = merge_dataframes(dataframe1, dataframe2)
result.show()

Conclusion

Merging PySpark DataFrames with different columns requires adding the missing columns (filled with null) to each DataFrame before performing the union.

The safest approach is unionByName(), which matches columns by name regardless of order.

If you are using PySpark 3.1+, the built-in allowMissingColumns=True parameter handles everything automatically, making it the simplest and most reliable solution.

For older PySpark versions, manually aligning columns with lit(None) and using unionByName() is the recommended pattern.

The Problem: Different Column Sets​

The Solution: Add Missing Columns With lit(None)​

Method 1: Using union()​

Method 2: Using unionAll()​

Method 3: Using unionByName() - The Safest Approach​

PySpark 3.1+: Built-In allowMissingColumns​

Comparison of Methods​

Complete Reusable Solution​

Conclusion​

Table of Contents

The Problem: Different Column Sets

The Solution: Add Missing Columns With `lit(None)`

Method 1: Using `union()`

Method 2: Using `unionAll()`

Method 3: Using `unionByName()` - The Safest Approach

PySpark 3.1+: Built-In `allowMissingColumns`

Comparison of Methods

Complete Reusable Solution

Conclusion