Skip to main content

How to Compare Two CSV Files in Python

Comparing CSV files is a common task in data validation, ETL pipeline testing, database migration verification, and tracking changes between different versions of a dataset. Depending on what you need to find, the comparison might involve detecting changed cell values, identifying added or removed rows, or verifying overall data integrity.

This guide covers practical comparison techniques using Pandas for analytical workflows and the built-in csv module for memory-efficient handling of large files.

Comparing Cell Values with df.compare()

When two CSV files share the same structure (identical rows and columns), the df.compare() method generates a clear diff report showing exactly which cells changed between the two versions.

Consider the two csv files:

orders_old.csv
order_id,amounts,status
1001,150,shipped
1002,50,pending
1003,100,NaN
orders_new.csv
order_id,amounts,status
1001,175,NaN
1002,50,pending
1003,NaN,NaN
1004,NaN,NaN
import pandas as pd

# Load CSVs with a common key as the index
df1 = pd.read_csv("orders_old.csv", index_col="order_id")
df2 = pd.read_csv("orders_new.csv", index_col="order_id")

# Compare the two DataFrames
diff = df1.compare(df2)
print(diff)

Example output:

         amounts          status      
self other self other
order_id
1001 150.0 175.0 shipped NaN
1003 100.0 NaN NaN NaN

The self column shows the value from the first DataFrame, and other shows the value from the second. Only cells that differ appear in the output. Matching cells are excluded entirely.

note

df.compare() requires both DataFrames to have the same shape and index. If your files have different numbers of rows, you will need the merge-based approach covered in the next section.

Keeping Equal Values for Context

By default, df.compare() only shows cells that differ, which can make it hard to understand the surrounding context. You can include all values by passing keep_equal=True and keep_shape=True:

diff = df1.compare(df2, keep_equal=True, keep_shape=True)
print(diff)

This produces a full-width output where every cell is shown, making it easier to see the changed values alongside the unchanged ones.

Finding Added or Removed Rows

When rows have been inserted or deleted between versions, cell-level comparison is not enough. Instead, use pd.merge() with the indicator parameter, which adds a column telling you where each row came from:

import pandas as pd

df1 = pd.read_csv("orders_old.csv")
df2 = pd.read_csv("orders_new.csv")

# Outer merge with an indicator column
merged = pd.merge(
df1, df2,
how="outer",
indicator=True,
suffixes=("_old", "_new"),
)

# Categorize the rows
added = merged[merged["_merge"] == "right_only"]
deleted = merged[merged["_merge"] == "left_only"]
common = merged[merged["_merge"] == "both"]

print(f"Added rows: {len(added)}")
print(f"Deleted rows: {len(deleted)}")
print(f"Common rows: {len(common)}")

Output:

Added rows:   2
Deleted rows: 2
Common rows: 1

The _merge column contains one of three values:

  • "left_only" means the row exists only in the first file (deleted).
  • "right_only" means the row exists only in the second file (added).
  • "both" means the row exists in both files.

Merging on Specific Key Columns

If your files do not have identical columns or you want to match rows based on specific identifiers, pass the on parameter:

merged = pd.merge(
df1, df2,
on=["customer_id", "email"],
how="outer",
indicator=True,
)

This matches rows only when both customer_id and email are the same across the two files.

Building a Complete Diff Report

For a comprehensive comparison that captures added rows, deleted rows, and modified cell values in a single report, you can combine both techniques into a reusable function:

import pandas as pd


def compare_csv_files(file1, file2, key_column):
"""Generate a complete comparison report between two CSV files."""
df1 = pd.read_csv(file1)
df2 = pd.read_csv(file2)

# Find added and deleted rows via outer merge
merged = pd.merge(
df1, df2,
on=key_column,
how="outer",
indicator=True,
suffixes=("_old", "_new"),
)

report = {
"added": merged[merged["_merge"] == "right_only"][key_column].tolist(),
"deleted": merged[merged["_merge"] == "left_only"][key_column].tolist(),
"modified": [],
}

# Check common rows for cell-level modifications
common = merged[merged["_merge"] == "both"]

for col in df1.columns:
if col == key_column:
continue
old_col = f"{col}_old"
new_col = f"{col}_new"
if old_col in common.columns and new_col in common.columns:
changed = common[common[old_col] != common[new_col]]
for _, row in changed.iterrows():
report["modified"].append({
"key": row[key_column],
"column": col,
"old_value": row[old_col],
"new_value": row[new_col],
})

return report


# Usage
report = compare_csv_files("orders_old.csv", "orders_new.csv", "order_id")
print(f"Added: {report['added']}")
print(f"Deleted: {report['deleted']}")
print(f"Modified cells: {len(report['modified'])}")

for change in report["modified"]:
print(f" Row {change['key']}, column '{change['column']}': "
f"{change['old_value']} -> {change['new_value']}")

Output:

Added:          []
Deleted: []
Modified cells: 4
Row 1001, column 'amounts': 150 -> 175.0
Row 1003, column 'amounts': 100 -> nan
Row 1001, column 'status': shipped -> nan
Row 1003, column 'status': nan -> nan

Ignoring Specific Columns

Metadata columns like timestamps or export sequence numbers often differ between files but are not meaningful for comparison. Drop them before comparing:

import pandas as pd

df1 = pd.read_csv("orders_old.csv", index_col="order_id")
df2 = pd.read_csv("orders_new.csv", index_col="order_id")

# Drop irrelevant columns before comparison
ignore_cols = ["updated_at", "export_timestamp"]
df1_clean = df1.drop(columns=ignore_cols, errors="ignore")
df2_clean = df2.drop(columns=ignore_cols, errors="ignore")

diff = df1_clean.compare(df2_clean)
print(diff)

Output:

         amounts       
self other
order_id
1001 150.0 175.0
1003 100.0 NaN
note

The errors="ignore" parameter ensures the code does not break if a column in the ignore list does not exist in one of the files.

Handling Data Type Mismatches

CSV files often represent the same data differently. One file might store a number as "1.0" while another stores it as "1". String columns might have inconsistent whitespace. Normalizing both DataFrames before comparison prevents false positives:

import pandas as pd


def normalize_for_comparison(df):
"""Normalize data types and formatting for fair comparison."""
df = df.copy()

for col in df.columns:
# Try converting string columns to numeric
if df[col].dtype == "object":
try:
df[col] = pd.to_numeric(df[col])
except (ValueError, TypeError):
# Not numeric; strip whitespace instead
df[col] = df[col].str.strip()

return df


df1 = normalize_for_comparison(pd.read_csv("file1.csv", index_col="id"))
df2 = normalize_for_comparison(pd.read_csv("file2.csv", index_col="id"))

diff = df1.compare(df2)
print(diff)

Output:

         amounts          status      
self other self other
order_id
1001 150.0 175.0 shipped NaN
1003 100.0 NaN NaN NaN

Without normalization, a value of "100.0" in one file and "100" in the other would be flagged as a difference even though they represent the same number.

caution

Be careful when normalizing data types automatically. Converting "001" to a numeric value turns it into 1, which might lose meaningful leading zeros (for example, in ZIP codes or product codes). Always consider the semantics of your data before applying blanket normalization.

Handling Large Files with the csv Module

When files are too large to fit into memory, Pandas is not practical. The built-in csv module lets you compare files without loading everything at once.

Comparing Row Keys

The simplest large-file comparison checks which keys exist in each file using sets:

import csv


def get_row_keys(filename, key_column=0):
"""Extract unique keys from a CSV without loading the entire file."""
keys = set()
with open(filename, newline="") as f:
reader = csv.reader(f)
next(reader) # Skip the header row
for row in reader:
keys.add(row[key_column])
return keys


def compare_large_csvs(file1, file2, key_column=0):
"""Compare two large CSVs by key column."""
keys1 = get_row_keys(file1, key_column)
keys2 = get_row_keys(file2, key_column)

return {
"added": keys2 - keys1,
"deleted": keys1 - keys2,
"common": keys1 & keys2,
}


result = compare_large_csvs("big_file_v1.csv", "big_file_v2.csv")
print(f"New records: {len(result['added'])}")
print(f"Removed records: {len(result['deleted'])}")
print(f"Common records: {len(result['common'])}")

Output:

New records:     1523
Removed records: 87
Common records: 248390

This approach only stores the key values in memory, not the full rows, making it suitable for files with millions of records.

Chunked Processing for Value Comparison

When you need to compare actual cell values in large files (not just keys), process the data in chunks so that only a small portion of each file is in memory at any given time:

import pandas as pd

chunk_size = 10_000
diffs = []

for chunk1, chunk2 in zip(
pd.read_csv("file1.csv", chunksize=chunk_size),
pd.read_csv("file2.csv", chunksize=chunk_size),
):
diff = chunk1.compare(chunk2)
if not diff.empty:
diffs.append(diff)

if diffs:
all_diffs = pd.concat(diffs)
print(f"Total differences found: {len(all_diffs)}")
print(all_diffs)
else:
print("Files are identical.")
tip

Chunked processing assumes both files have the same row order and the same number of rows. If rows have been added, deleted, or reordered between versions, use the key-based approach instead and compare matching rows individually.

Summary

GoalMethodBest For
Cell-level changesdf.compare()Files with identical structure, finding value modifications
Row additions and deletionspd.merge(indicator=True)Tracking which records were added or removed
Complete diff reportCombined merge and compareComprehensive audit of all changes
Large file key comparisoncsv module with setsMemory-constrained environments
Large file value comparisonChunked pd.read_csv()Comparing values without loading full files

For most analytical and data validation work, Pandas provides the most robust and readable comparison tools, with proper handling of data types, missing values, and edge cases. Use the csv module approach only when memory constraints prevent loading the full datasets, or when you need to compare files with millions of rows and only care about key-level differences.