How to Compare Two CSV Files in Python
Comparing CSV files is a common task in data validation, ETL pipeline testing, database migration verification, and tracking changes between different versions of a dataset. Depending on what you need to find, the comparison might involve detecting changed cell values, identifying added or removed rows, or verifying overall data integrity.
This guide covers practical comparison techniques using Pandas for analytical workflows and the built-in csv module for memory-efficient handling of large files.
Comparing Cell Values with df.compare()
When two CSV files share the same structure (identical rows and columns), the df.compare() method generates a clear diff report showing exactly which cells changed between the two versions.
Consider the two csv files:
order_id,amounts,status
1001,150,shipped
1002,50,pending
1003,100,NaN
order_id,amounts,status
1001,175,NaN
1002,50,pending
1003,NaN,NaN
1004,NaN,NaN
import pandas as pd
# Load CSVs with a common key as the index
df1 = pd.read_csv("orders_old.csv", index_col="order_id")
df2 = pd.read_csv("orders_new.csv", index_col="order_id")
# Compare the two DataFrames
diff = df1.compare(df2)
print(diff)
Example output:
amounts status
self other self other
order_id
1001 150.0 175.0 shipped NaN
1003 100.0 NaN NaN NaN
The self column shows the value from the first DataFrame, and other shows the value from the second. Only cells that differ appear in the output. Matching cells are excluded entirely.
df.compare() requires both DataFrames to have the same shape and index. If your files have different numbers of rows, you will need the merge-based approach covered in the next section.
Keeping Equal Values for Context
By default, df.compare() only shows cells that differ, which can make it hard to understand the surrounding context. You can include all values by passing keep_equal=True and keep_shape=True:
diff = df1.compare(df2, keep_equal=True, keep_shape=True)
print(diff)
This produces a full-width output where every cell is shown, making it easier to see the changed values alongside the unchanged ones.
Finding Added or Removed Rows
When rows have been inserted or deleted between versions, cell-level comparison is not enough. Instead, use pd.merge() with the indicator parameter, which adds a column telling you where each row came from:
import pandas as pd
df1 = pd.read_csv("orders_old.csv")
df2 = pd.read_csv("orders_new.csv")
# Outer merge with an indicator column
merged = pd.merge(
df1, df2,
how="outer",
indicator=True,
suffixes=("_old", "_new"),
)
# Categorize the rows
added = merged[merged["_merge"] == "right_only"]
deleted = merged[merged["_merge"] == "left_only"]
common = merged[merged["_merge"] == "both"]
print(f"Added rows: {len(added)}")
print(f"Deleted rows: {len(deleted)}")
print(f"Common rows: {len(common)}")
Output:
Added rows: 2
Deleted rows: 2
Common rows: 1
The _merge column contains one of three values:
"left_only"means the row exists only in the first file (deleted)."right_only"means the row exists only in the second file (added)."both"means the row exists in both files.
Merging on Specific Key Columns
If your files do not have identical columns or you want to match rows based on specific identifiers, pass the on parameter:
merged = pd.merge(
df1, df2,
on=["customer_id", "email"],
how="outer",
indicator=True,
)
This matches rows only when both customer_id and email are the same across the two files.
Building a Complete Diff Report
For a comprehensive comparison that captures added rows, deleted rows, and modified cell values in a single report, you can combine both techniques into a reusable function:
import pandas as pd
def compare_csv_files(file1, file2, key_column):
"""Generate a complete comparison report between two CSV files."""
df1 = pd.read_csv(file1)
df2 = pd.read_csv(file2)
# Find added and deleted rows via outer merge
merged = pd.merge(
df1, df2,
on=key_column,
how="outer",
indicator=True,
suffixes=("_old", "_new"),
)
report = {
"added": merged[merged["_merge"] == "right_only"][key_column].tolist(),
"deleted": merged[merged["_merge"] == "left_only"][key_column].tolist(),
"modified": [],
}
# Check common rows for cell-level modifications
common = merged[merged["_merge"] == "both"]
for col in df1.columns:
if col == key_column:
continue
old_col = f"{col}_old"
new_col = f"{col}_new"
if old_col in common.columns and new_col in common.columns:
changed = common[common[old_col] != common[new_col]]
for _, row in changed.iterrows():
report["modified"].append({
"key": row[key_column],
"column": col,
"old_value": row[old_col],
"new_value": row[new_col],
})
return report
# Usage
report = compare_csv_files("orders_old.csv", "orders_new.csv", "order_id")
print(f"Added: {report['added']}")
print(f"Deleted: {report['deleted']}")
print(f"Modified cells: {len(report['modified'])}")
for change in report["modified"]:
print(f" Row {change['key']}, column '{change['column']}': "
f"{change['old_value']} -> {change['new_value']}")
Output:
Added: []
Deleted: []
Modified cells: 4
Row 1001, column 'amounts': 150 -> 175.0
Row 1003, column 'amounts': 100 -> nan
Row 1001, column 'status': shipped -> nan
Row 1003, column 'status': nan -> nan
Ignoring Specific Columns
Metadata columns like timestamps or export sequence numbers often differ between files but are not meaningful for comparison. Drop them before comparing:
import pandas as pd
df1 = pd.read_csv("orders_old.csv", index_col="order_id")
df2 = pd.read_csv("orders_new.csv", index_col="order_id")
# Drop irrelevant columns before comparison
ignore_cols = ["updated_at", "export_timestamp"]
df1_clean = df1.drop(columns=ignore_cols, errors="ignore")
df2_clean = df2.drop(columns=ignore_cols, errors="ignore")
diff = df1_clean.compare(df2_clean)
print(diff)
Output:
amounts
self other
order_id
1001 150.0 175.0
1003 100.0 NaN
The errors="ignore" parameter ensures the code does not break if a column in the ignore list does not exist in one of the files.
Handling Data Type Mismatches
CSV files often represent the same data differently. One file might store a number as "1.0" while another stores it as "1". String columns might have inconsistent whitespace. Normalizing both DataFrames before comparison prevents false positives:
import pandas as pd
def normalize_for_comparison(df):
"""Normalize data types and formatting for fair comparison."""
df = df.copy()
for col in df.columns:
# Try converting string columns to numeric
if df[col].dtype == "object":
try:
df[col] = pd.to_numeric(df[col])
except (ValueError, TypeError):
# Not numeric; strip whitespace instead
df[col] = df[col].str.strip()
return df
df1 = normalize_for_comparison(pd.read_csv("file1.csv", index_col="id"))
df2 = normalize_for_comparison(pd.read_csv("file2.csv", index_col="id"))
diff = df1.compare(df2)
print(diff)
Output:
amounts status
self other self other
order_id
1001 150.0 175.0 shipped NaN
1003 100.0 NaN NaN NaN
Without normalization, a value of "100.0" in one file and "100" in the other would be flagged as a difference even though they represent the same number.
Be careful when normalizing data types automatically. Converting "001" to a numeric value turns it into 1, which might lose meaningful leading zeros (for example, in ZIP codes or product codes). Always consider the semantics of your data before applying blanket normalization.
Handling Large Files with the csv Module
When files are too large to fit into memory, Pandas is not practical. The built-in csv module lets you compare files without loading everything at once.
Comparing Row Keys
The simplest large-file comparison checks which keys exist in each file using sets:
import csv
def get_row_keys(filename, key_column=0):
"""Extract unique keys from a CSV without loading the entire file."""
keys = set()
with open(filename, newline="") as f:
reader = csv.reader(f)
next(reader) # Skip the header row
for row in reader:
keys.add(row[key_column])
return keys
def compare_large_csvs(file1, file2, key_column=0):
"""Compare two large CSVs by key column."""
keys1 = get_row_keys(file1, key_column)
keys2 = get_row_keys(file2, key_column)
return {
"added": keys2 - keys1,
"deleted": keys1 - keys2,
"common": keys1 & keys2,
}
result = compare_large_csvs("big_file_v1.csv", "big_file_v2.csv")
print(f"New records: {len(result['added'])}")
print(f"Removed records: {len(result['deleted'])}")
print(f"Common records: {len(result['common'])}")
Output:
New records: 1523
Removed records: 87
Common records: 248390
This approach only stores the key values in memory, not the full rows, making it suitable for files with millions of records.
Chunked Processing for Value Comparison
When you need to compare actual cell values in large files (not just keys), process the data in chunks so that only a small portion of each file is in memory at any given time:
import pandas as pd
chunk_size = 10_000
diffs = []
for chunk1, chunk2 in zip(
pd.read_csv("file1.csv", chunksize=chunk_size),
pd.read_csv("file2.csv", chunksize=chunk_size),
):
diff = chunk1.compare(chunk2)
if not diff.empty:
diffs.append(diff)
if diffs:
all_diffs = pd.concat(diffs)
print(f"Total differences found: {len(all_diffs)}")
print(all_diffs)
else:
print("Files are identical.")
Chunked processing assumes both files have the same row order and the same number of rows. If rows have been added, deleted, or reordered between versions, use the key-based approach instead and compare matching rows individually.
Summary
| Goal | Method | Best For |
|---|---|---|
| Cell-level changes | df.compare() | Files with identical structure, finding value modifications |
| Row additions and deletions | pd.merge(indicator=True) | Tracking which records were added or removed |
| Complete diff report | Combined merge and compare | Comprehensive audit of all changes |
| Large file key comparison | csv module with sets | Memory-constrained environments |
| Large file value comparison | Chunked pd.read_csv() | Comparing values without loading full files |
For most analytical and data validation work, Pandas provides the most robust and readable comparison tools, with proper handling of data types, missing values, and edge cases. Use the csv module approach only when memory constraints prevent loading the full datasets, or when you need to compare files with millions of rows and only care about key-level differences.