How to Compare Two Files Using Hashing in Python

Comparing files by name or size alone is not reliable enough to confirm that two files are truly identical. Names can differ while contents match, and two completely different files can happen to share the same size. To verify that files have exactly the same content, whether you are checking for duplicates, validating downloads, or verifying backups, you need to compare the actual data inside them.

Cryptographic hashing provides an efficient way to do this. Instead of comparing every byte of two files directly, you generate a fixed-length fingerprint (hash) for each file and compare the fingerprints. If the hashes match, the files are identical.

This guide covers how to implement file hashing correctly in Python, how to optimize the process, and which algorithms to choose.

Why You Should Read Files in Chunks

Before looking at the implementation, it is important to understand a critical mistake that can crash your application. Loading an entire file into memory with a single file.read() call works for small files, but will consume enormous amounts of RAM for large files:

import hashlib

# Wrong approach: loads the entire file into memory at once
def get_file_hash_bad(filepath):
    with open(filepath, "rb") as f:
        data = f.read()  # A 10 GB file consumes 10 GB of RAM
    return hashlib.sha256(data).hexdigest()

If you run this against a 10 GB video file, your system may run out of memory and crash. The correct approach is to read the file in small, fixed-size chunks and feed each chunk to the hash function incrementally.

Never Load Entire Files Into Memory

Avoid using file.read() without a size limit. Always read in chunks to handle files of any size safely. The chunked approach uses a constant amount of memory regardless of file size.

The Chunked Hashing Method

The following function reads a file in 64 KB chunks, updates the hash incrementally, and returns the final hex digest:

import hashlib


def get_file_hash(filepath, algorithm="sha256"):
    """Generate a hash of file contents using chunked reading."""
    hasher = hashlib.new(algorithm)

    with open(filepath, "rb") as f:
        while chunk := f.read(65536):  # Read in 64 KB chunks
            hasher.update(chunk)

    return hasher.hexdigest()


def compare_files(file1, file2):
    """Return True if both files have identical content."""
    return get_file_hash(file1) == get_file_hash(file2)


# Usage
if compare_files("image1.jpg", "backup.jpg"):
    print("Files are identical")
else:
    print("Files differ")

Files are identical

This approach uses a constant amount of memory (roughly 64 KB) regardless of whether the file is 1 KB or 100 GB.

Optimization: Check File Size First

Hashing is a CPU-intensive operation. Before spending processing time computing hashes, you can perform an instant size check. If two files have different sizes, their contents cannot possibly be identical:

import os
import hashlib


def get_file_hash(filepath):
    """Generate SHA-256 hash of a file using chunked reading."""
    hasher = hashlib.sha256()
    with open(filepath, "rb") as f:
        while chunk := f.read(65536):
            hasher.update(chunk)
    return hasher.hexdigest()


def smart_compare(file1, file2):
    """Compare files with a size check optimization."""
    # Step 1: Quick size comparison (instant, no I/O beyond metadata)
    if os.path.getsize(file1) != os.path.getsize(file2):
        return False

    # Step 2: Full hash comparison (thorough, reads both files)
    return get_file_hash(file1) == get_file_hash(file2)


result = smart_compare("report_v1.pdf", "report_v2.pdf")
print(f"Files are identical: {result}")

Files are identical: False

tip

When comparing many files, group them by size first. Only files with matching sizes need hash computation. This simple optimization can eliminate the majority of comparisons before any file content is read.

Finding Duplicate Files in a Directory

A common real-world task is scanning a directory tree to find duplicate files. The strategy is to first group files by size (an instant metadata check), and then compute hashes only for files that share a size:

import hashlib
from pathlib import Path
from collections import defaultdict


def get_file_hash(filepath):
    """Generate SHA-256 hash of a file."""
    hasher = hashlib.sha256()
    with open(filepath, "rb") as f:
        while chunk := f.read(65536):
            hasher.update(chunk)
    return hasher.hexdigest()


def find_duplicates(directory):
    """Find all duplicate files in a directory tree."""
    # Step 1: Group files by size
    size_groups = defaultdict(list)

    for path in Path(directory).rglob("*"):
        if path.is_file():
            size = path.stat().st_size
            size_groups[size].append(path)

    # Step 2: Hash only files that share a size with at least one other file
    hash_groups = defaultdict(list)

    for size, files in size_groups.items():
        if len(files) < 2:
            continue  # Unique size means unique file, skip hashing

        for filepath in files:
            file_hash = get_file_hash(filepath)
            hash_groups[file_hash].append(filepath)

    # Step 3: Return only groups with actual duplicates
    return {h: paths for h, paths in hash_groups.items() if len(paths) > 1}


# Usage
dupes = find_duplicates("/path/to/photos")

if not dupes:
    print("No duplicates found.")
else:
    for file_hash, paths in dupes.items():
        print(f"\nDuplicate set (hash: {file_hash[:12]}...):")
        for p in paths:
            print(f"  {p}")

Duplicate set (hash: 3a7bd3e2f1c0...):
  /path/to/photos/vacation/IMG_001.jpg
  /path/to/photos/backup/IMG_001_copy.jpg

Duplicate set (hash: 9f2c18d4a7b3...):
  /path/to/photos/2023/sunset.png
  /path/to/photos/favorites/sunset.png
  /path/to/photos/shared/sunset.png

The size-first grouping means that if you have 10,000 files but only 200 share a size with another file, you only compute 200 hashes instead of 10,000.

Choosing the Right Hashing Algorithm

Python's hashlib module provides several hashing algorithms. The right choice depends on whether you need security, speed, or compatibility:

Algorithm	Speed	Security	Best Used For
MD5	Fast	Broken (collisions possible)	Legacy compatibility only
SHA-1	Medium	Weak (deprecated)	Avoid for new projects
SHA-256	Medium	Strong	Default choice for most use cases
BLAKE2b	Very fast	Strong	High-performance file comparison

Using BLAKE2b for Better Performance

If you are comparing many files or working with very large files and performance matters, BLAKE2b is an excellent alternative. It is faster than SHA-256 while providing the same level of security:

import hashlib


def get_file_hash_fast(filepath):
    """Use BLAKE2b for faster hashing with strong security."""
    hasher = hashlib.blake2b()

    with open(filepath, "rb") as f:
        while chunk := f.read(65536):
            hasher.update(chunk)

    return hasher.hexdigest()


hash1 = get_file_hash_fast("large_video.mp4")
print(f"BLAKE2b hash: {hash1[:24]}...")

BLAKE2b hash: 4a7f3c9e1b2d8a5f6c0e7d...

When to Use Each Algorithm

SHA-256: The safe default. Widely supported, strong security, and good enough performance for most applications.
BLAKE2b: Choose this when processing large files or running many comparisons where speed matters. It is part of Python's standard library since Python 3.6.
MD5: Use only for legacy compatibility or non-security contexts where you need to match an existing MD5 checksum. Never rely on MD5 for security-critical integrity checks.

Verifying Downloads Against a Known Hash

A practical use case for file hashing is verifying that a downloaded file matches the checksum published by its author. This confirms the file was not corrupted during transfer or tampered with:

import hashlib


def get_file_hash(filepath, algorithm="sha256"):
    """Generate a hash of file contents using chunked reading."""
    hasher = hashlib.new(algorithm)
    with open(filepath, "rb") as f:
        while chunk := f.read(65536):
            hasher.update(chunk)
    return hasher.hexdigest()


def verify_download(filepath, expected_hash, algorithm="sha256"):
    """Verify that a file matches an expected hash value."""
    actual_hash = get_file_hash(filepath, algorithm)

    if actual_hash.lower() == expected_hash.lower():
        print(f"Verified: {filepath}")
        return True
    else:
        print(f"Hash mismatch for {filepath}")
        print(f"  Expected: {expected_hash}")
        print(f"  Actual:   {actual_hash}")
        return False


# Usage
verify_download(
    "python-3.12.0.tar.xz",
    "795c34f44df45a0e9b9a3c99d77fb61d8eb3c857d8c8f1f5b0c3c3e09e5b3a8f",
)

Verified: python-3.12.0.tar.xz

Handling Errors Gracefully

Production code should account for common file access problems such as missing files, permission issues, and corrupted paths:

import hashlib
import os


def get_file_hash(filepath):
    """Generate SHA-256 hash of a file."""
    hasher = hashlib.sha256()
    with open(filepath, "rb") as f:
        while chunk := f.read(65536):
            hasher.update(chunk)
    return hasher.hexdigest()


def safe_compare(file1, file2):
    """Compare files with comprehensive error handling."""
    try:
        if not os.path.isfile(file1):
            raise FileNotFoundError(f"File not found: {file1}")
        if not os.path.isfile(file2):
            raise FileNotFoundError(f"File not found: {file2}")

        # Quick size check
        if os.path.getsize(file1) != os.path.getsize(file2):
            return False

        # Full hash comparison
        return get_file_hash(file1) == get_file_hash(file2)

    except PermissionError as e:
        print(f"Permission denied: {e}")
        raise
    except OSError as e:
        print(f"Error accessing file: {e}")
        raise


# Usage
try:
    result = safe_compare("data.bin", "data_backup.bin")
    print(f"Files match: {result}")
except FileNotFoundError as e:
    print(e)

File not found: data_backup.bin

Summary

Comparing files using hashing is a two-step process that balances speed and thoroughness:

Size check using os.path.getsize() for instant rejection of files that cannot possibly match.
Chunked hash comparison using hashlib for bit-perfect content verification without memory issues.

Step	Method	Purpose
Quick check	`os.path.getsize()`	Eliminate obvious mismatches instantly
Thorough check	Chunked SHA-256 or BLAKE2b	Verify identical content byte by byte
Batch deduplication	Size grouping then hashing	Minimize hash computations across many files

Use SHA-256 as your default algorithm for its strong security and broad compatibility. Switch to BLAKE2b when processing large files or running many comparisons where speed is a priority. Regardless of which algorithm you choose, always read files in chunks to ensure your code works reliably with files of any size.

Why You Should Read Files in Chunks​

The Chunked Hashing Method​

Optimization: Check File Size First​

Finding Duplicate Files in a Directory​

Choosing the Right Hashing Algorithm​

Using BLAKE2b for Better Performance​

Verifying Downloads Against a Known Hash​

Handling Errors Gracefully​

Summary​

Table of Contents