How to Compare Two CSV Files in Python

Comparing CSV files is a common task in data validation, ETL pipeline testing, database migration verification, and tracking changes between different versions of a dataset. Depending on what you need to find, the comparison might involve detecting changed cell values, identifying added or removed rows, or verifying overall data integrity.

This guide covers practical comparison techniques using Pandas for analytical workflows and the built-in csv module for memory-efficient handling of large files.

Comparing Cell Values with df.compare()

When two CSV files share the same structure (identical rows and columns), the df.compare() method generates a clear diff report showing exactly which cells changed between the two versions.

Consider the two csv files:

orders_old.csv
order_id,amounts,status
1001,150,shipped
1002,50,pending
1003,100,NaN

orders_new.csv
order_id,amounts,status
1001,175,NaN
1002,50,pending
1003,NaN,NaN
1004,NaN,NaN

import pandas as pd

# Load CSVs with a common key as the index
df1 = pd.read_csv("orders_old.csv", index_col="order_id")
df2 = pd.read_csv("orders_new.csv", index_col="order_id")

# Compare the two DataFrames
diff = df1.compare(df2)
print(diff)

Example output:

         amounts          status      
            self  other     self other
order_id                              
1001       150.0  175.0  shipped   NaN
1003       100.0    NaN      NaN   NaN

The self column shows the value from the first DataFrame, and other shows the value from the second. Only cells that differ appear in the output. Matching cells are excluded entirely.

note

df.compare() requires both DataFrames to have the same shape and index. If your files have different numbers of rows, you will need the merge-based approach covered in the next section.

Keeping Equal Values for Context

By default, df.compare() only shows cells that differ, which can make it hard to understand the surrounding context. You can include all values by passing keep_equal=True and keep_shape=True:

diff = df1.compare(df2, keep_equal=True, keep_shape=True)
print(diff)

This produces a full-width output where every cell is shown, making it easier to see the changed values alongside the unchanged ones.

Finding Added or Removed Rows

When rows have been inserted or deleted between versions, cell-level comparison is not enough. Instead, use pd.merge() with the indicator parameter, which adds a column telling you where each row came from:

import pandas as pd

df1 = pd.read_csv("orders_old.csv")
df2 = pd.read_csv("orders_new.csv")

# Outer merge with an indicator column
merged = pd.merge(
    df1, df2,
    how="outer",
    indicator=True,
    suffixes=("_old", "_new"),
)

# Categorize the rows
added = merged[merged["_merge"] == "right_only"]
deleted = merged[merged["_merge"] == "left_only"]
common = merged[merged["_merge"] == "both"]

print(f"Added rows:   {len(added)}")
print(f"Deleted rows: {len(deleted)}")
print(f"Common rows:  {len(common)}")

Output:

Added rows:   2
Deleted rows: 2
Common rows:  1

The _merge column contains one of three values:

"left_only" means the row exists only in the first file (deleted).
"right_only" means the row exists only in the second file (added).
"both" means the row exists in both files.

Merging on Specific Key Columns

If your files do not have identical columns or you want to match rows based on specific identifiers, pass the on parameter:

merged = pd.merge(
    df1, df2,
    on=["customer_id", "email"],
    how="outer",
    indicator=True,
)

This matches rows only when both customer_id and email are the same across the two files.

Building a Complete Diff Report

For a comprehensive comparison that captures added rows, deleted rows, and modified cell values in a single report, you can combine both techniques into a reusable function:

import pandas as pd


def compare_csv_files(file1, file2, key_column):
    """Generate a complete comparison report between two CSV files."""
    df1 = pd.read_csv(file1)
    df2 = pd.read_csv(file2)

    # Find added and deleted rows via outer merge
    merged = pd.merge(
        df1, df2,
        on=key_column,
        how="outer",
        indicator=True,
        suffixes=("_old", "_new"),
    )

    report = {
        "added": merged[merged["_merge"] == "right_only"][key_column].tolist(),
        "deleted": merged[merged["_merge"] == "left_only"][key_column].tolist(),
        "modified": [],
    }

    # Check common rows for cell-level modifications
    common = merged[merged["_merge"] == "both"]

    for col in df1.columns:
        if col == key_column:
            continue
        old_col = f"{col}_old"
        new_col = f"{col}_new"
        if old_col in common.columns and new_col in common.columns:
            changed = common[common[old_col] != common[new_col]]
            for _, row in changed.iterrows():
                report["modified"].append({
                    "key": row[key_column],
                    "column": col,
                    "old_value": row[old_col],
                    "new_value": row[new_col],
                })

    return report


# Usage
report = compare_csv_files("orders_old.csv", "orders_new.csv", "order_id")
print(f"Added:          {report['added']}")
print(f"Deleted:        {report['deleted']}")
print(f"Modified cells: {len(report['modified'])}")

for change in report["modified"]:
    print(f"  Row {change['key']}, column '{change['column']}': "
          f"{change['old_value']} -> {change['new_value']}")

Output:

Added:          []
Deleted:        []
Modified cells: 4
  Row 1001, column 'amounts': 150 -> 175.0
  Row 1003, column 'amounts': 100 -> nan
  Row 1001, column 'status': shipped -> nan
  Row 1003, column 'status': nan -> nan

Ignoring Specific Columns

Metadata columns like timestamps or export sequence numbers often differ between files but are not meaningful for comparison. Drop them before comparing:

import pandas as pd

df1 = pd.read_csv("orders_old.csv", index_col="order_id")
df2 = pd.read_csv("orders_new.csv", index_col="order_id")

# Drop irrelevant columns before comparison
ignore_cols = ["updated_at", "export_timestamp"]
df1_clean = df1.drop(columns=ignore_cols, errors="ignore")
df2_clean = df2.drop(columns=ignore_cols, errors="ignore")

diff = df1_clean.compare(df2_clean)
print(diff)

Output:

         amounts       
            self  other
order_id               
1001       150.0  175.0
1003       100.0    NaN

note

The errors="ignore" parameter ensures the code does not break if a column in the ignore list does not exist in one of the files.

Handling Data Type Mismatches

CSV files often represent the same data differently. One file might store a number as "1.0" while another stores it as "1". String columns might have inconsistent whitespace. Normalizing both DataFrames before comparison prevents false positives:

import pandas as pd


def normalize_for_comparison(df):
    """Normalize data types and formatting for fair comparison."""
    df = df.copy()

    for col in df.columns:
        # Try converting string columns to numeric
        if df[col].dtype == "object":
            try:
                df[col] = pd.to_numeric(df[col])
            except (ValueError, TypeError):
                # Not numeric; strip whitespace instead
                df[col] = df[col].str.strip()

    return df


df1 = normalize_for_comparison(pd.read_csv("file1.csv", index_col="id"))
df2 = normalize_for_comparison(pd.read_csv("file2.csv", index_col="id"))

diff = df1.compare(df2)
print(diff)

Output:

         amounts          status      
            self  other     self other
order_id                              
1001       150.0  175.0  shipped   NaN
1003       100.0    NaN      NaN   NaN

Without normalization, a value of "100.0" in one file and "100" in the other would be flagged as a difference even though they represent the same number.

warning

Be careful when normalizing data types automatically. Converting "001" to a numeric value turns it into 1, which might lose meaningful leading zeros (for example, in ZIP codes or product codes). Always consider the semantics of your data before applying blanket normalization.

Handling Large Files with the csv Module

When files are too large to fit into memory, Pandas is not practical. The built-in csv module lets you compare files without loading everything at once.

Comparing Row Keys

The simplest large-file comparison checks which keys exist in each file using sets:

import csv


def get_row_keys(filename, key_column=0):
    """Extract unique keys from a CSV without loading the entire file."""
    keys = set()
    with open(filename, newline="") as f:
        reader = csv.reader(f)
        next(reader)  # Skip the header row
        for row in reader:
            keys.add(row[key_column])
    return keys


def compare_large_csvs(file1, file2, key_column=0):
    """Compare two large CSVs by key column."""
    keys1 = get_row_keys(file1, key_column)
    keys2 = get_row_keys(file2, key_column)

    return {
        "added": keys2 - keys1,
        "deleted": keys1 - keys2,
        "common": keys1 & keys2,
    }


result = compare_large_csvs("big_file_v1.csv", "big_file_v2.csv")
print(f"New records:     {len(result['added'])}")
print(f"Removed records: {len(result['deleted'])}")
print(f"Common records:  {len(result['common'])}")

Output:

New records:     1523
Removed records: 87
Common records:  248390

This approach only stores the key values in memory, not the full rows, making it suitable for files with millions of records.

Chunked Processing for Value Comparison

When you need to compare actual cell values in large files (not just keys), process the data in chunks so that only a small portion of each file is in memory at any given time:

import pandas as pd

chunk_size = 10_000
diffs = []

for chunk1, chunk2 in zip(
    pd.read_csv("file1.csv", chunksize=chunk_size),
    pd.read_csv("file2.csv", chunksize=chunk_size),
):
    diff = chunk1.compare(chunk2)
    if not diff.empty:
        diffs.append(diff)

if diffs:
    all_diffs = pd.concat(diffs)
    print(f"Total differences found: {len(all_diffs)}")
    print(all_diffs)
else:
    print("Files are identical.")

tip

Chunked processing assumes both files have the same row order and the same number of rows. If rows have been added, deleted, or reordered between versions, use the key-based approach instead and compare matching rows individually.

Summary

Goal	Method	Best For
Cell-level changes	`df.compare()`	Files with identical structure, finding value modifications
Row additions and deletions	`pd.merge(indicator=True)`	Tracking which records were added or removed
Complete diff report	Combined merge and compare	Comprehensive audit of all changes
Large file key comparison	`csv` module with sets	Memory-constrained environments
Large file value comparison	Chunked `pd.read_csv()`	Comparing values without loading full files

For most analytical and data validation work, Pandas provides the most robust and readable comparison tools, with proper handling of data types, missing values, and edge cases. Use the csv module approach only when memory constraints prevent loading the full datasets, or when you need to compare files with millions of rows and only care about key-level differences.

Comparing Cell Values with df.compare()​

Keeping Equal Values for Context​

Finding Added or Removed Rows​

Merging on Specific Key Columns​

Building a Complete Diff Report​

Ignoring Specific Columns​

Handling Data Type Mismatches​

Handling Large Files with the csv Module​

Comparing Row Keys​

Chunked Processing for Value Comparison​

Summary​

Table of Contents

Comparing Cell Values with df.compare()

Keeping Equal Values for Context

Finding Added or Removed Rows

Merging on Specific Key Columns

Building a Complete Diff Report

Ignoring Specific Columns

Handling Data Type Mismatches

Handling Large Files with the csv Module

Comparing Row Keys

Chunked Processing for Value Comparison

Summary