Skip to main content

How to Compare Two Text Files in Python

Comparing text files is a common task in programming, whether you're verifying that a data export hasn't changed, checking configuration file differences, validating test outputs, or tracking document revisions. Python provides multiple built-in tools to compare files at different levels of detail, from a simple "same or different" check to a line-by-line diff with exact changes highlighted.

In this guide, you'll learn several methods to compare two text files in Python, each suited to different scenarios and levels of detail.

For the following sections, consider the following files for various examples:

# Sample configuration
host = 127.0.0.1
port = 8080
debug = True
max_connections = 100
log_level = INFO
# Sample configuration
host = 127.0.0.1
port = 9090
debug = True
max_connections = 100
log_level = DEBUG

Quick Equality Check with filecmp

The filecmp module provides the simplest way to check if two files are identical. It returns a single boolean: True if the files match, False if they don't.

import filecmp

result = filecmp.cmp('config_1.txt', 'config_2.txt', shallow=False)

if result:
print("Files are identical.")
else:
print("Files differ.")

Output:

Files differ.

Parameters:

  • shallow=True (default): Compares file metadata (size, modification time). Faster but less reliable.
  • shallow=False: Compares actual file contents byte by byte. Slower but accurate.
tip

Always use shallow=False when you need to confirm that file contents are identical. The default shallow=True only compares metadata, which can give false positives if files were modified but have the same size and timestamp.

Hash-Based Comparison for Large Files

For large files, computing a cryptographic hash (like SHA-256) of each file and comparing the hashes is fast and memory-efficient. If the hashes match, the files are identical.

import hashlib

def file_hash(filepath):
"""Compute SHA-256 hash of a file in memory-efficient chunks."""
hasher = hashlib.sha256()
with open(filepath, 'rb') as f:
for chunk in iter(lambda: f.read(8192), b""):
hasher.update(chunk)
return hasher.hexdigest()

hash1 = file_hash('config_1.txt')
hash2 = file_hash('config_2.txt')

print(f"File 1 hash: {hash1}")
print(f"File 2 hash: {hash2}")
print("Files are identical." if hash1 == hash2 else "Files differ.")

Output:

File 1 hash: 2c411f8c9b74049656bd1e3b32e4c1c33a81ebe8f3e991dd25e5ab96007a4210
File 2 hash: f405f09320712d8d9ef3825845ffb04a1ef8faf64d6a8eaeb20729882ee20c2b
Files differ.

How it works:

  1. The file is read in small chunks (8 KB), keeping memory usage low regardless of file size.
  2. Each chunk updates the hash incrementally.
  3. The final hash is a unique fingerprint of the file's contents.
note

Hash comparison tells you whether files differ but not where they differ. Use line-by-line methods for detailed differences.

Line-by-Line Comparison

To see exactly which lines differ between two files, compare them line by line. This is the most informative approach for text files.

Streaming Comparison (Memory-Efficient)

This method reads one line at a time from each file, keeping memory usage minimal:

def compare_files_streaming(file1_path, file2_path):
"""Compare two files line by line without loading them entirely into memory."""
differences = []

with open(file1_path, 'r') as f1, open(file2_path, 'r') as f2:
for line_num, (line1, line2) in enumerate(zip(f1, f2), start=1):
if line1 != line2:
differences.append({
'line': line_num,
'file1': line1.strip(),
'file2': line2.strip()
})

return differences


diffs = compare_files_streaming('config_1.txt', 'config_2.txt')

if diffs:
print(f"Found {len(diffs)} difference(s):\n")
for d in diffs:
print(f"Line {d['line']}:")
print(f" File 1: {d['file1']}")
print(f" File 2: {d['file2']}")
print()
else:
print("Files are identical.")

Output:

Found 2 difference(s):

Line 3:
File 1: port = 8080
File 2: port = 9090

Line 6:
File 1: log_level = INFO
File 2: log_level = DEBUG
Limitation: zip() Stops at the Shorter File

If the files have different numbers of lines, zip() silently ignores the extra lines in the longer file. To catch length differences, add a check after the loop:

from itertools import zip_longest

def compare_files_complete(file1_path, file2_path):
"""Compare files line by line, handling different lengths."""
differences = []

with open(file1_path, 'r') as f1, open(file2_path, 'r') as f2:
sentinel = object()
for line_num, (line1, line2) in enumerate(
zip_longest(f1, f2, fillvalue=sentinel), start=1
):
if line1 is sentinel:
differences.append((line_num, "<missing>", line2.strip()))
elif line2 is sentinel:
differences.append((line_num, line1.strip(), "<missing>"))
elif line1 != line2:
differences.append((line_num, line1.strip(), line2.strip()))

return differences

This version detects extra lines in either file.

Using difflib for Unified Diff Output

Python's difflib module generates professional diff output similar to Unix diff or Git-style diffs. This is the most detailed and human-readable approach.

Unified Diff

import difflib

with open('config_1.txt', 'r') as f1, open('config_2.txt', 'r') as f2:
file1_lines = f1.readlines()
file2_lines = f2.readlines()

diff = difflib.unified_diff(
file1_lines,
file2_lines,
fromfile='config_1.txt',
tofile='config_2.txt',
lineterm=''
)

output = '\n'.join(diff)
if output:
print(output)
else:
print("Files are identical.")

Output:

--- config_1.txt
+++ config_2.txt
@@ -1,6 +1,6 @@
# Sample configuration

host = 127.0.0.1

-port = 8080

+port = 9090

debug = True

max_connections = 100

-log_level = INFO

+log_level = DEBUG
note

Lines prefixed with - were removed (from file 1), and lines with + were added (from file 2). Unchanged lines provide context.

Side-by-Side HTML Diff

For visual comparison, difflib.HtmlDiff generates an HTML file with side-by-side highlighting:

import difflib

with open('config_1.txt', 'r') as f1, open('config_2.txt', 'r') as f2:
file1_lines = f1.readlines()
file2_lines = f2.readlines()

html_diff = difflib.HtmlDiff()
result = html_diff.make_file(
file1_lines,
file2_lines,
fromdesc='config_1.txt',
todesc='config_2.txt'
)

with open('diff_report.html', 'w') as report:
report.write(result)

print("HTML diff report saved to diff_report.html")

Open the generated diff_report.html in a browser to see a color-coded comparison.

Using readlines() for Simple Comparison

For smaller files, you can load both files entirely into memory and compare line by line:

def compare_with_readlines(file1_path, file2_path):
"""Load both files into memory and compare line by line."""
with open(file1_path, 'r') as f1, open(file2_path, 'r') as f2:
lines1 = f1.readlines()
lines2 = f2.readlines()

max_lines = max(len(lines1), len(lines2))

for i in range(max_lines):
line1 = lines1[i].strip() if i < len(lines1) else "<missing>"
line2 = lines2[i].strip() if i < len(lines2) else "<missing>"

if line1 == line2:
print(f"Line {i + 1}: IDENTICAL")
else:
print(f"Line {i + 1}: DIFFERENT")
print(f" File 1: {line1}")
print(f" File 2: {line2}")


compare_with_readlines('config_1.txt', 'config_2.txt')

Output:

Line 1: IDENTICAL
Line 2: IDENTICAL
Line 3: DIFFERENT
File 1: port = 8080
File 2: port = 9090
Line 4: IDENTICAL
Line 5: IDENTICAL
Line 6: DIFFERENT
File 1: log_level = INFO
File 2: log_level = DEBUG
note

readlines() loads the entire file into memory, so it's not ideal for very large files. For files over a few hundred megabytes, use the streaming approach or hash comparison instead.

Practical Example: Comparing Configuration Files

A real-world use case is detecting changes in configuration files:

import difflib

def compare_configs(old_path, new_path):
"""Compare two config files and report changes."""
with open(old_path) as f1, open(new_path) as f2:
old_lines = f1.readlines()
new_lines = f2.readlines()

diff = list(difflib.unified_diff(old_lines, new_lines, n=0))

added = [line[1:].strip() for line in diff if line.startswith('+') and not line.startswith('+++')]
removed = [line[1:].strip() for line in diff if line.startswith('-') and not line.startswith('---')]

print(f"Added ({len(added)}):")
for line in added:
print(f" + {line}")

print(f"\nRemoved ({len(removed)}):")
for line in removed:
print(f" - {line}")

if not added and not removed:
print("No changes detected.")


compare_configs('config_1.txt', 'config_2.txt')

Output:

Added (2):
+ port = 9090
+ log_level = DEBUG

Removed (2):
- port = 8080
- log_level = INFO

Quick Comparison of Methods

MethodShows DifferencesMemory EfficientBest For
filecmp.cmp()❌ (yes/no only)Quick identical check
Hash comparison❌ (yes/no only)Large files, integrity verification
Streaming line-by-line✅ (line numbers)Large files with line-level detail
difflib.unified_diff✅ (full diff output)🔶 (loads into memory)Professional diff reports
difflib.HtmlDiff✅ (visual HTML)🔶 (loads into memory)Visual side-by-side comparison
readlines() comparison✅ (line numbers)❌ (loads into memory)Small files, simple scripts

Conclusion

Python provides flexible tools for comparing text files at every level of detail:

  • Use filecmp.cmp() for a quick boolean check. Ideal when you only need to know if files are identical.
  • Use hash comparison for large files where you need a fast, memory-efficient equality check.
  • Use line-by-line streaming for memory-efficient comparison that shows exactly which lines differ.
  • Use difflib.unified_diff() for professional, human-readable diff output similar to git diff.
  • Use difflib.HtmlDiff() to generate visual, color-coded comparison reports for documentation or review.

Choose the method that matches your needs, a simple same/different answer, a list of changed lines, or a full visual diff report.