How to Compare Two Files Using Hashing in Python
Comparing files by name or size alone is not reliable enough to confirm that two files are truly identical. Names can differ while contents match, and two completely different files can happen to share the same size. To verify that files have exactly the same content, whether you are checking for duplicates, validating downloads, or verifying backups, you need to compare the actual data inside them.
Cryptographic hashing provides an efficient way to do this. Instead of comparing every byte of two files directly, you generate a fixed-length fingerprint (hash) for each file and compare the fingerprints. If the hashes match, the files are identical.
This guide covers how to implement file hashing correctly in Python, how to optimize the process, and which algorithms to choose.
Why You Should Read Files in Chunks
Before looking at the implementation, it is important to understand a critical mistake that can crash your application. Loading an entire file into memory with a single file.read() call works for small files, but will consume enormous amounts of RAM for large files:
import hashlib
# Wrong approach: loads the entire file into memory at once
def get_file_hash_bad(filepath):
with open(filepath, "rb") as f:
data = f.read() # A 10 GB file consumes 10 GB of RAM
return hashlib.sha256(data).hexdigest()
If you run this against a 10 GB video file, your system may run out of memory and crash. The correct approach is to read the file in small, fixed-size chunks and feed each chunk to the hash function incrementally.
Avoid using file.read() without a size limit. Always read in chunks to handle files of any size safely. The chunked approach uses a constant amount of memory regardless of file size.
The Chunked Hashing Method
The following function reads a file in 64 KB chunks, updates the hash incrementally, and returns the final hex digest:
import hashlib
def get_file_hash(filepath, algorithm="sha256"):
"""Generate a hash of file contents using chunked reading."""
hasher = hashlib.new(algorithm)
with open(filepath, "rb") as f:
while chunk := f.read(65536): # Read in 64 KB chunks
hasher.update(chunk)
return hasher.hexdigest()
def compare_files(file1, file2):
"""Return True if both files have identical content."""
return get_file_hash(file1) == get_file_hash(file2)
# Usage
if compare_files("image1.jpg", "backup.jpg"):
print("Files are identical")
else:
print("Files differ")
Files are identical
This approach uses a constant amount of memory (roughly 64 KB) regardless of whether the file is 1 KB or 100 GB.
Optimization: Check File Size First
Hashing is a CPU-intensive operation. Before spending processing time computing hashes, you can perform an instant size check. If two files have different sizes, their contents cannot possibly be identical:
import os
import hashlib
def get_file_hash(filepath):
"""Generate SHA-256 hash of a file using chunked reading."""
hasher = hashlib.sha256()
with open(filepath, "rb") as f:
while chunk := f.read(65536):
hasher.update(chunk)
return hasher.hexdigest()
def smart_compare(file1, file2):
"""Compare files with a size check optimization."""
# Step 1: Quick size comparison (instant, no I/O beyond metadata)
if os.path.getsize(file1) != os.path.getsize(file2):
return False
# Step 2: Full hash comparison (thorough, reads both files)
return get_file_hash(file1) == get_file_hash(file2)
result = smart_compare("report_v1.pdf", "report_v2.pdf")
print(f"Files are identical: {result}")
Files are identical: False
When comparing many files, group them by size first. Only files with matching sizes need hash computation. This simple optimization can eliminate the majority of comparisons before any file content is read.
Finding Duplicate Files in a Directory
A common real-world task is scanning a directory tree to find duplicate files. The strategy is to first group files by size (an instant metadata check), and then compute hashes only for files that share a size:
import hashlib
from pathlib import Path
from collections import defaultdict
def get_file_hash(filepath):
"""Generate SHA-256 hash of a file."""
hasher = hashlib.sha256()
with open(filepath, "rb") as f:
while chunk := f.read(65536):
hasher.update(chunk)
return hasher.hexdigest()
def find_duplicates(directory):
"""Find all duplicate files in a directory tree."""
# Step 1: Group files by size
size_groups = defaultdict(list)
for path in Path(directory).rglob("*"):
if path.is_file():
size = path.stat().st_size
size_groups[size].append(path)
# Step 2: Hash only files that share a size with at least one other file
hash_groups = defaultdict(list)
for size, files in size_groups.items():
if len(files) < 2:
continue # Unique size means unique file, skip hashing
for filepath in files:
file_hash = get_file_hash(filepath)
hash_groups[file_hash].append(filepath)
# Step 3: Return only groups with actual duplicates
return {h: paths for h, paths in hash_groups.items() if len(paths) > 1}
# Usage
dupes = find_duplicates("/path/to/photos")
if not dupes:
print("No duplicates found.")
else:
for file_hash, paths in dupes.items():
print(f"\nDuplicate set (hash: {file_hash[:12]}...):")
for p in paths:
print(f" {p}")
Duplicate set (hash: 3a7bd3e2f1c0...):
/path/to/photos/vacation/IMG_001.jpg
/path/to/photos/backup/IMG_001_copy.jpg
Duplicate set (hash: 9f2c18d4a7b3...):
/path/to/photos/2023/sunset.png
/path/to/photos/favorites/sunset.png
/path/to/photos/shared/sunset.png
The size-first grouping means that if you have 10,000 files but only 200 share a size with another file, you only compute 200 hashes instead of 10,000.
Choosing the Right Hashing Algorithm
Python's hashlib module provides several hashing algorithms. The right choice depends on whether you need security, speed, or compatibility:
| Algorithm | Speed | Security | Best Used For |
|---|---|---|---|
| MD5 | Fast | Broken (collisions possible) | Legacy compatibility only |
| SHA-1 | Medium | Weak (deprecated) | Avoid for new projects |
| SHA-256 | Medium | Strong | Default choice for most use cases |
| BLAKE2b | Very fast | Strong | High-performance file comparison |
Using BLAKE2b for Better Performance
If you are comparing many files or working with very large files and performance matters, BLAKE2b is an excellent alternative. It is faster than SHA-256 while providing the same level of security:
import hashlib
def get_file_hash_fast(filepath):
"""Use BLAKE2b for faster hashing with strong security."""
hasher = hashlib.blake2b()
with open(filepath, "rb") as f:
while chunk := f.read(65536):
hasher.update(chunk)
return hasher.hexdigest()
hash1 = get_file_hash_fast("large_video.mp4")
print(f"BLAKE2b hash: {hash1[:24]}...")
BLAKE2b hash: 4a7f3c9e1b2d8a5f6c0e7d...
- SHA-256: The safe default. Widely supported, strong security, and good enough performance for most applications.
- BLAKE2b: Choose this when processing large files or running many comparisons where speed matters. It is part of Python's standard library since Python 3.6.
- MD5: Use only for legacy compatibility or non-security contexts where you need to match an existing MD5 checksum. Never rely on MD5 for security-critical integrity checks.
Verifying Downloads Against a Known Hash
A practical use case for file hashing is verifying that a downloaded file matches the checksum published by its author. This confirms the file was not corrupted during transfer or tampered with:
import hashlib
def get_file_hash(filepath, algorithm="sha256"):
"""Generate a hash of file contents using chunked reading."""
hasher = hashlib.new(algorithm)
with open(filepath, "rb") as f:
while chunk := f.read(65536):
hasher.update(chunk)
return hasher.hexdigest()
def verify_download(filepath, expected_hash, algorithm="sha256"):
"""Verify that a file matches an expected hash value."""
actual_hash = get_file_hash(filepath, algorithm)
if actual_hash.lower() == expected_hash.lower():
print(f"Verified: {filepath}")
return True
else:
print(f"Hash mismatch for {filepath}")
print(f" Expected: {expected_hash}")
print(f" Actual: {actual_hash}")
return False
# Usage
verify_download(
"python-3.12.0.tar.xz",
"795c34f44df45a0e9b9a3c99d77fb61d8eb3c857d8c8f1f5b0c3c3e09e5b3a8f",
)
Verified: python-3.12.0.tar.xz
Handling Errors Gracefully
Production code should account for common file access problems such as missing files, permission issues, and corrupted paths:
import hashlib
import os
def get_file_hash(filepath):
"""Generate SHA-256 hash of a file."""
hasher = hashlib.sha256()
with open(filepath, "rb") as f:
while chunk := f.read(65536):
hasher.update(chunk)
return hasher.hexdigest()
def safe_compare(file1, file2):
"""Compare files with comprehensive error handling."""
try:
if not os.path.isfile(file1):
raise FileNotFoundError(f"File not found: {file1}")
if not os.path.isfile(file2):
raise FileNotFoundError(f"File not found: {file2}")
# Quick size check
if os.path.getsize(file1) != os.path.getsize(file2):
return False
# Full hash comparison
return get_file_hash(file1) == get_file_hash(file2)
except PermissionError as e:
print(f"Permission denied: {e}")
raise
except OSError as e:
print(f"Error accessing file: {e}")
raise
# Usage
try:
result = safe_compare("data.bin", "data_backup.bin")
print(f"Files match: {result}")
except FileNotFoundError as e:
print(e)
File not found: data_backup.bin
Summary
Comparing files using hashing is a two-step process that balances speed and thoroughness:
- Size check using
os.path.getsize()for instant rejection of files that cannot possibly match. - Chunked hash comparison using
hashlibfor bit-perfect content verification without memory issues.
| Step | Method | Purpose |
|---|---|---|
| Quick check | os.path.getsize() | Eliminate obvious mismatches instantly |
| Thorough check | Chunked SHA-256 or BLAKE2b | Verify identical content byte by byte |
| Batch deduplication | Size grouping then hashing | Minimize hash computations across many files |
Use SHA-256 as your default algorithm for its strong security and broad compatibility. Switch to BLAKE2b when processing large files or running many comparisons where speed is a priority. Regardless of which algorithm you choose, always read files in chunks to ensure your code works reliably with files of any size.