Skip to main content

Python Pandas: How to Compare Two DataFrames with Pandas compare()

When working with data in pandas, you'll often need to identify differences between two DataFrames, whether you're tracking changes in datasets over time, validating data transformations, auditing updates, or debugging data pipelines. The DataFrame.compare() method, introduced in pandas 1.1.0, provides a clean and structured way to highlight exactly what changed between two DataFrames.

In this guide, you'll learn how to use compare() effectively, understand each of its parameters, and see practical examples that cover common comparison scenarios.

Prerequisites

The compare() method requires pandas 1.1.0 or later. Check your version and upgrade if needed:

import pandas as pd
print(pd.__version__)

If your version is below 1.1.0, upgrade with:

pip install pandas --upgrade

Understanding DataFrame.compare()

The compare() method compares two DataFrames element by element and returns a new DataFrame highlighting only the differences.

Syntax:

DataFrame.compare(other, align_axis=1, keep_shape=False, keep_equal=False)
ParameterTypeDefaultDescription
otherDataFrame-The DataFrame to compare against
align_axis0 or 111: differences shown as columns; 0: differences shown as rows
keep_shapeboolFalseIf True, keep all rows and columns (not just those with differences)
keep_equalboolFalseIf True, show equal values instead of replacing them with NaN

Setting Up the DataFrames

Let's create two DataFrames that share the same structure but have some different values:

import pandas as pd

# Original DataFrame
df_original = pd.DataFrame({
'Product': ['Pens', 'Scales', 'Pencils', 'Geometry Box', 'Crayon Set'],
'Price': [100, 50, 25, 100, 65],
'Quantity': [10, 5, 5, 2, 1]
})

# Updated DataFrame with some changes
df_updated = df_original.copy()
df_updated.loc[0, 'Price'] = 150 # Changed: 100 → 150
df_updated.loc[1, 'Price'] = 70 # Changed: 50 → 70
df_updated.loc[2, 'Price'] = 30 # Changed: 25 → 30
df_updated.loc[0, 'Quantity'] = 15 # Changed: 10 → 15
df_updated.loc[1, 'Quantity'] = 7 # Changed: 5 → 7
df_updated.loc[2, 'Quantity'] = 6 # Changed: 5 → 6

print("Original DataFrame:")
print(df_original)
print("\nUpdated DataFrame:")
print(df_updated)

Output:

Original DataFrame:
Product Price Quantity
0 Pens 100 10
1 Scales 50 5
2 Pencils 25 5
3 Geometry Box 100 2
4 Crayon Set 65 1

Updated DataFrame:
Product Price Quantity
0 Pens 150 15
1 Scales 70 7
2 Pencils 30 6
3 Geometry Box 100 2
4 Crayon Set 65 1

Basic Comparison (Default Settings)

With default settings, compare() shows only the rows and columns that have differences, with self and other labels indicating the old and new values:

import pandas as pd

# Original DataFrame
df_original = pd.DataFrame({
'Product': ['Pens', 'Scales', 'Pencils', 'Geometry Box', 'Crayon Set'],
'Price': [100, 50, 25, 100, 65],
'Quantity': [10, 5, 5, 2, 1]
})

# Updated DataFrame with some changes
df_updated = df_original.copy()
df_updated.loc[0, 'Price'] = 150 # Changed: 100 → 150
df_updated.loc[1, 'Price'] = 70 # Changed: 50 → 70
df_updated.loc[2, 'Price'] = 30 # Changed: 25 → 30
df_updated.loc[0, 'Quantity'] = 15 # Changed: 10 → 15
df_updated.loc[1, 'Quantity'] = 7 # Changed: 5 → 7
df_updated.loc[2, 'Quantity'] = 6 # Changed: 5 → 6

diff = df_original.compare(df_updated)
print(diff)

Output:

   Price        Quantity      
self other self other
0 100.0 150.0 10.0 15.0
1 50.0 70.0 5.0 7.0
2 25.0 30.0 5.0 6.0

Only rows 0, 1, and 2 appear because rows 3 and 4 are identical. Only Price and Quantity columns appear because Product values didn't change. The self column shows the original value and other shows the updated value.

Aligning Differences on Rows (align_axis=0)

Setting align_axis=0 displays differences vertically with self and other as row labels:

import pandas as pd

# Original DataFrame
df_original = pd.DataFrame({
'Product': ['Pens', 'Scales', 'Pencils', 'Geometry Box', 'Crayon Set'],
'Price': [100, 50, 25, 100, 65],
'Quantity': [10, 5, 5, 2, 1]
})

# Updated DataFrame with some changes
df_updated = df_original.copy()
df_updated.loc[0, 'Price'] = 150 # Changed: 100 → 150
df_updated.loc[1, 'Price'] = 70 # Changed: 50 → 70
df_updated.loc[2, 'Price'] = 30 # Changed: 25 → 30
df_updated.loc[0, 'Quantity'] = 15 # Changed: 10 → 15
df_updated.loc[1, 'Quantity'] = 7 # Changed: 5 → 7
df_updated.loc[2, 'Quantity'] = 6 # Changed: 5 → 6

diff = df_original.compare(df_updated, align_axis=0)
print(diff)

Output:

         Price  Quantity
0 self 100.0 10.0
other 150.0 15.0
1 self 50.0 5.0
other 70.0 7.0
2 self 25.0 5.0
other 30.0 6.0

This format can be easier to read when comparing many columns, as each row pair shows the before and after values side by side.

Keeping the Full Shape (keep_shape=True)

By default, compare() only shows rows and columns with differences. Setting keep_shape=True retains the entire DataFrame structure, filling unchanged values with NaN:

import pandas as pd

# Original DataFrame
df_original = pd.DataFrame({
'Product': ['Pens', 'Scales', 'Pencils', 'Geometry Box', 'Crayon Set'],
'Price': [100, 50, 25, 100, 65],
'Quantity': [10, 5, 5, 2, 1]
})

# Updated DataFrame with some changes
df_updated = df_original.copy()
df_updated.loc[0, 'Price'] = 150 # Changed: 100 → 150
df_updated.loc[1, 'Price'] = 70 # Changed: 50 → 70
df_updated.loc[2, 'Price'] = 30 # Changed: 25 → 30
df_updated.loc[0, 'Quantity'] = 15 # Changed: 10 → 15
df_updated.loc[1, 'Quantity'] = 7 # Changed: 5 → 7
df_updated.loc[2, 'Quantity'] = 6 # Changed: 5 → 6

diff = df_original.compare(df_updated, keep_shape=True)
print(diff)

Output:

  Product        Price        Quantity      
self other self other self other
0 NaN NaN 100.0 150.0 10.0 15.0
1 NaN NaN 50.0 70.0 5.0 7.0
2 NaN NaN 25.0 30.0 5.0 6.0
3 NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN

All rows and columns are present. Unchanged values (like Product and rows 3–4) appear as NaN.

Showing Equal Values (keep_equal=True)

Setting keep_equal=True displays the actual values for unchanged cells instead of NaN:

diff = df_original.compare(df_updated, keep_shape=True, keep_equal=True)
print(diff)

Output:

        Product               Price        Quantity
self other self other self other
0 Pens Pens 100 150 10 15
1 Scales Scales 50 70 5 7
2 Pencils Pencils 25 30 5 6
3 Geometry Box Geometry Box 100 100 2 2
4 Crayon Set Crayon Set 65 65 1 1

This gives you a complete side-by-side view of both DataFrames, making it easy to see what changed and what stayed the same.

Comparing Only Differences Without Shape

Using just keep_equal=True (without keep_shape) shows equal values for changed rows but still hides entirely unchanged rows:

import pandas as pd

# Original DataFrame
df_original = pd.DataFrame({
'Product': ['Pens', 'Scales', 'Pencils', 'Geometry Box', 'Crayon Set'],
'Price': [100, 50, 25, 100, 65],
'Quantity': [10, 5, 5, 2, 1]
})

# Updated DataFrame with some changes
df_updated = df_original.copy()
df_updated.loc[0, 'Price'] = 150 # Changed: 100 → 150
df_updated.loc[1, 'Price'] = 70 # Changed: 50 → 70
df_updated.loc[2, 'Price'] = 30 # Changed: 25 → 30
df_updated.loc[0, 'Quantity'] = 15 # Changed: 10 → 15
df_updated.loc[1, 'Quantity'] = 7 # Changed: 5 → 7
df_updated.loc[2, 'Quantity'] = 6 # Changed: 5 → 6

diff = df_original.compare(df_updated, keep_equal=True)
print(diff)

Output:

  Price       Quantity      
self other self other
0 100 150 10 15
1 50 70 5 7
2 25 30 5 6
Parameter Combination Summary
keep_shapekeep_equalResult
FalseFalseOnly changed rows/columns, differences only (default)
FalseTrueOnly changed rows/columns, all values shown
TrueFalseAll rows/columns, unchanged values as NaN
TrueTrueAll rows/columns, all values shown

Practical Example: Tracking Price Changes

Here's a real-world example that uses compare() to generate a change report:

import pandas as pd

# Weekly price data
week1 = pd.DataFrame({
'Product': ['Laptop', 'Phone', 'Tablet', 'Monitor', 'Keyboard'],
'Price': [999, 699, 329, 249, 79],
'Stock': [50, 120, 85, 60, 200]
})

week2 = pd.DataFrame({
'Product': ['Laptop', 'Phone', 'Tablet', 'Monitor', 'Keyboard'],
'Price': [949, 699, 349, 249, 69],
'Stock': [42, 95, 85, 55, 180]
})

# Find differences
changes = week1.compare(week2)

print("Changes between Week 1 and Week 2:")
print(changes)

# Count number of changes per column
print(f"\nTotal changes per column:")
for col in changes.columns.get_level_values(0).unique():
count = changes[col].dropna().shape[0]
print(f" {col}: {count} changes")

Output:

Changes between Week 1 and Week 2:
Price Stock
self other self other
0 999.0 949.0 50.0 42.0
1 NaN NaN 120.0 95.0
2 329.0 349.0 NaN NaN
3 NaN NaN 60.0 55.0
4 79.0 69.0 200.0 180.0

Total changes per column:
Price: 3 changes
Stock: 4 changes

Common Errors and How to Avoid Them

Error: Mismatched DataFrame Shapes

The compare() method requires both DataFrames to have identical indices and columns. If they don't match, you'll get a ValueError:

import pandas as pd

df1 = pd.DataFrame({'A': [1, 2, 3]})
df2 = pd.DataFrame({'A': [1, 2, 3, 4]})

df1.compare(df2)

Output:

ValueError: Can only compare identically-labeled DataFrame objects

Solution: Ensure both DataFrames have the same shape and labels before comparing. If needed, align them first:

import pandas as pd

df1 = pd.DataFrame({'A': [1, 2, 3]})
df2 = pd.DataFrame({'A': [1, 2, 3, 4]})

# Reset index and trim to matching rows
common_index = df1.index.intersection(df2.index)
df1_aligned = df1.loc[common_index]
df2_aligned = df2.loc[common_index]

diff = df1_aligned.compare(df2_aligned)
Error: Different Column Names

The column names must also match exactly:

import pandas as pd

df1 = pd.DataFrame({'Name': ['Alice'], 'Age': [30]})
df2 = pd.DataFrame({'Name': ['Alice'], 'age': [31]}) # lowercase 'age'

df1.compare(df2) # ValueError

Solution: Standardize column names before comparing:

df2.columns = df1.columns  # Or use df.rename()

Alternative: Using != for Quick Difference Detection

For a simpler boolean view of where differences exist, use the != operator:

import pandas as pd

df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': [1, 9, 3], 'B': [4, 5, 8]})

print("Where values differ:")
print(df1 != df2)

Output:

Where values differ:
A B
0 False False
1 True False
2 False True

This tells you where differences exist but not what the differences are. That's where compare() excels.

Conclusion

The DataFrame.compare() method is a powerful tool for identifying and visualizing differences between two DataFrames:

  • Use the default settings for a concise view of only the changed values.
  • Set align_axis=0 to display differences vertically for better readability with many columns.
  • Set keep_shape=True to retain the full DataFrame structure and see which rows were unchanged.
  • Set keep_equal=True to display actual values instead of NaN for unchanged cells.
  • Always ensure both DataFrames have identical indices and columns before comparing to avoid ValueError.

For quick boolean detection of where differences exist, the != operator is a simpler alternative. For detailed change tracking with old and new values, compare() is the ideal choice.