Python Pandas: How to Compare Two DataFrames with Pandas compare()
When working with data in pandas, you'll often need to identify differences between two DataFrames, whether you're tracking changes in datasets over time, validating data transformations, auditing updates, or debugging data pipelines. The DataFrame.compare() method, introduced in pandas 1.1.0, provides a clean and structured way to highlight exactly what changed between two DataFrames.
In this guide, you'll learn how to use compare() effectively, understand each of its parameters, and see practical examples that cover common comparison scenarios.
Prerequisites
The compare() method requires pandas 1.1.0 or later. Check your version and upgrade if needed:
import pandas as pd
print(pd.__version__)
If your version is below 1.1.0, upgrade with:
pip install pandas --upgrade
Understanding DataFrame.compare()
The compare() method compares two DataFrames element by element and returns a new DataFrame highlighting only the differences.
Syntax:
DataFrame.compare(other, align_axis=1, keep_shape=False, keep_equal=False)
| Parameter | Type | Default | Description |
|---|---|---|---|
other | DataFrame | - | The DataFrame to compare against |
align_axis | 0 or 1 | 1 | 1: differences shown as columns; 0: differences shown as rows |
keep_shape | bool | False | If True, keep all rows and columns (not just those with differences) |
keep_equal | bool | False | If True, show equal values instead of replacing them with NaN |
Setting Up the DataFrames
Let's create two DataFrames that share the same structure but have some different values:
import pandas as pd
# Original DataFrame
df_original = pd.DataFrame({
'Product': ['Pens', 'Scales', 'Pencils', 'Geometry Box', 'Crayon Set'],
'Price': [100, 50, 25, 100, 65],
'Quantity': [10, 5, 5, 2, 1]
})
# Updated DataFrame with some changes
df_updated = df_original.copy()
df_updated.loc[0, 'Price'] = 150 # Changed: 100 → 150
df_updated.loc[1, 'Price'] = 70 # Changed: 50 → 70
df_updated.loc[2, 'Price'] = 30 # Changed: 25 → 30
df_updated.loc[0, 'Quantity'] = 15 # Changed: 10 → 15
df_updated.loc[1, 'Quantity'] = 7 # Changed: 5 → 7
df_updated.loc[2, 'Quantity'] = 6 # Changed: 5 → 6
print("Original DataFrame:")
print(df_original)
print("\nUpdated DataFrame:")
print(df_updated)
Output:
Original DataFrame:
Product Price Quantity
0 Pens 100 10
1 Scales 50 5
2 Pencils 25 5
3 Geometry Box 100 2
4 Crayon Set 65 1
Updated DataFrame:
Product Price Quantity
0 Pens 150 15
1 Scales 70 7
2 Pencils 30 6
3 Geometry Box 100 2
4 Crayon Set 65 1
Basic Comparison (Default Settings)
With default settings, compare() shows only the rows and columns that have differences, with self and other labels indicating the old and new values:
import pandas as pd
# Original DataFrame
df_original = pd.DataFrame({
'Product': ['Pens', 'Scales', 'Pencils', 'Geometry Box', 'Crayon Set'],
'Price': [100, 50, 25, 100, 65],
'Quantity': [10, 5, 5, 2, 1]
})
# Updated DataFrame with some changes
df_updated = df_original.copy()
df_updated.loc[0, 'Price'] = 150 # Changed: 100 → 150
df_updated.loc[1, 'Price'] = 70 # Changed: 50 → 70
df_updated.loc[2, 'Price'] = 30 # Changed: 25 → 30
df_updated.loc[0, 'Quantity'] = 15 # Changed: 10 → 15
df_updated.loc[1, 'Quantity'] = 7 # Changed: 5 → 7
df_updated.loc[2, 'Quantity'] = 6 # Changed: 5 → 6
diff = df_original.compare(df_updated)
print(diff)
Output:
Price Quantity
self other self other
0 100.0 150.0 10.0 15.0
1 50.0 70.0 5.0 7.0
2 25.0 30.0 5.0 6.0
Only rows 0, 1, and 2 appear because rows 3 and 4 are identical. Only Price and Quantity columns appear because Product values didn't change. The self column shows the original value and other shows the updated value.
Aligning Differences on Rows (align_axis=0)
Setting align_axis=0 displays differences vertically with self and other as row labels:
import pandas as pd
# Original DataFrame
df_original = pd.DataFrame({
'Product': ['Pens', 'Scales', 'Pencils', 'Geometry Box', 'Crayon Set'],
'Price': [100, 50, 25, 100, 65],
'Quantity': [10, 5, 5, 2, 1]
})
# Updated DataFrame with some changes
df_updated = df_original.copy()
df_updated.loc[0, 'Price'] = 150 # Changed: 100 → 150
df_updated.loc[1, 'Price'] = 70 # Changed: 50 → 70
df_updated.loc[2, 'Price'] = 30 # Changed: 25 → 30
df_updated.loc[0, 'Quantity'] = 15 # Changed: 10 → 15
df_updated.loc[1, 'Quantity'] = 7 # Changed: 5 → 7
df_updated.loc[2, 'Quantity'] = 6 # Changed: 5 → 6
diff = df_original.compare(df_updated, align_axis=0)
print(diff)
Output:
Price Quantity
0 self 100.0 10.0
other 150.0 15.0
1 self 50.0 5.0
other 70.0 7.0
2 self 25.0 5.0
other 30.0 6.0
This format can be easier to read when comparing many columns, as each row pair shows the before and after values side by side.
Keeping the Full Shape (keep_shape=True)
By default, compare() only shows rows and columns with differences. Setting keep_shape=True retains the entire DataFrame structure, filling unchanged values with NaN:
import pandas as pd
# Original DataFrame
df_original = pd.DataFrame({
'Product': ['Pens', 'Scales', 'Pencils', 'Geometry Box', 'Crayon Set'],
'Price': [100, 50, 25, 100, 65],
'Quantity': [10, 5, 5, 2, 1]
})
# Updated DataFrame with some changes
df_updated = df_original.copy()
df_updated.loc[0, 'Price'] = 150 # Changed: 100 → 150
df_updated.loc[1, 'Price'] = 70 # Changed: 50 → 70
df_updated.loc[2, 'Price'] = 30 # Changed: 25 → 30
df_updated.loc[0, 'Quantity'] = 15 # Changed: 10 → 15
df_updated.loc[1, 'Quantity'] = 7 # Changed: 5 → 7
df_updated.loc[2, 'Quantity'] = 6 # Changed: 5 → 6
diff = df_original.compare(df_updated, keep_shape=True)
print(diff)
Output:
Product Price Quantity
self other self other self other
0 NaN NaN 100.0 150.0 10.0 15.0
1 NaN NaN 50.0 70.0 5.0 7.0
2 NaN NaN 25.0 30.0 5.0 6.0
3 NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN
All rows and columns are present. Unchanged values (like Product and rows 3–4) appear as NaN.
Showing Equal Values (keep_equal=True)
Setting keep_equal=True displays the actual values for unchanged cells instead of NaN:
diff = df_original.compare(df_updated, keep_shape=True, keep_equal=True)
print(diff)
Output:
Product Price Quantity
self other self other self other
0 Pens Pens 100 150 10 15
1 Scales Scales 50 70 5 7
2 Pencils Pencils 25 30 5 6
3 Geometry Box Geometry Box 100 100 2 2
4 Crayon Set Crayon Set 65 65 1 1
This gives you a complete side-by-side view of both DataFrames, making it easy to see what changed and what stayed the same.
Comparing Only Differences Without Shape
Using just keep_equal=True (without keep_shape) shows equal values for changed rows but still hides entirely unchanged rows:
import pandas as pd
# Original DataFrame
df_original = pd.DataFrame({
'Product': ['Pens', 'Scales', 'Pencils', 'Geometry Box', 'Crayon Set'],
'Price': [100, 50, 25, 100, 65],
'Quantity': [10, 5, 5, 2, 1]
})
# Updated DataFrame with some changes
df_updated = df_original.copy()
df_updated.loc[0, 'Price'] = 150 # Changed: 100 → 150
df_updated.loc[1, 'Price'] = 70 # Changed: 50 → 70
df_updated.loc[2, 'Price'] = 30 # Changed: 25 → 30
df_updated.loc[0, 'Quantity'] = 15 # Changed: 10 → 15
df_updated.loc[1, 'Quantity'] = 7 # Changed: 5 → 7
df_updated.loc[2, 'Quantity'] = 6 # Changed: 5 → 6
diff = df_original.compare(df_updated, keep_equal=True)
print(diff)
Output:
Price Quantity
self other self other
0 100 150 10 15
1 50 70 5 7
2 25 30 5 6
keep_shape | keep_equal | Result |
|---|---|---|
False | False | Only changed rows/columns, differences only (default) |
False | True | Only changed rows/columns, all values shown |
True | False | All rows/columns, unchanged values as NaN |
True | True | All rows/columns, all values shown |
Practical Example: Tracking Price Changes
Here's a real-world example that uses compare() to generate a change report:
import pandas as pd
# Weekly price data
week1 = pd.DataFrame({
'Product': ['Laptop', 'Phone', 'Tablet', 'Monitor', 'Keyboard'],
'Price': [999, 699, 329, 249, 79],
'Stock': [50, 120, 85, 60, 200]
})
week2 = pd.DataFrame({
'Product': ['Laptop', 'Phone', 'Tablet', 'Monitor', 'Keyboard'],
'Price': [949, 699, 349, 249, 69],
'Stock': [42, 95, 85, 55, 180]
})
# Find differences
changes = week1.compare(week2)
print("Changes between Week 1 and Week 2:")
print(changes)
# Count number of changes per column
print(f"\nTotal changes per column:")
for col in changes.columns.get_level_values(0).unique():
count = changes[col].dropna().shape[0]
print(f" {col}: {count} changes")
Output:
Changes between Week 1 and Week 2:
Price Stock
self other self other
0 999.0 949.0 50.0 42.0
1 NaN NaN 120.0 95.0
2 329.0 349.0 NaN NaN
3 NaN NaN 60.0 55.0
4 79.0 69.0 200.0 180.0
Total changes per column:
Price: 3 changes
Stock: 4 changes
Common Errors and How to Avoid Them
The compare() method requires both DataFrames to have identical indices and columns. If they don't match, you'll get a ValueError:
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2, 3]})
df2 = pd.DataFrame({'A': [1, 2, 3, 4]})
df1.compare(df2)
Output:
ValueError: Can only compare identically-labeled DataFrame objects
Solution: Ensure both DataFrames have the same shape and labels before comparing. If needed, align them first:
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2, 3]})
df2 = pd.DataFrame({'A': [1, 2, 3, 4]})
# Reset index and trim to matching rows
common_index = df1.index.intersection(df2.index)
df1_aligned = df1.loc[common_index]
df2_aligned = df2.loc[common_index]
diff = df1_aligned.compare(df2_aligned)
The column names must also match exactly:
import pandas as pd
df1 = pd.DataFrame({'Name': ['Alice'], 'Age': [30]})
df2 = pd.DataFrame({'Name': ['Alice'], 'age': [31]}) # lowercase 'age'
df1.compare(df2) # ValueError
Solution: Standardize column names before comparing:
df2.columns = df1.columns # Or use df.rename()
Alternative: Using != for Quick Difference Detection
For a simpler boolean view of where differences exist, use the != operator:
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': [1, 9, 3], 'B': [4, 5, 8]})
print("Where values differ:")
print(df1 != df2)
Output:
Where values differ:
A B
0 False False
1 True False
2 False True
This tells you where differences exist but not what the differences are. That's where compare() excels.
Conclusion
The DataFrame.compare() method is a powerful tool for identifying and visualizing differences between two DataFrames:
- Use the default settings for a concise view of only the changed values.
- Set
align_axis=0to display differences vertically for better readability with many columns. - Set
keep_shape=Trueto retain the full DataFrame structure and see which rows were unchanged. - Set
keep_equal=Trueto display actual values instead ofNaNfor unchanged cells. - Always ensure both DataFrames have identical indices and columns before comparing to avoid
ValueError.
For quick boolean detection of where differences exist, the != operator is a simpler alternative. For detailed change tracking with old and new values, compare() is the ideal choice.