Skip to main content

How to Detect Text File Encoding in Python

When open(file).read() fails with a UnicodeDecodeError, the file's actual encoding does not match what Python assumed. Before you can process the file correctly, you need to identify its real encoding. Python offers several libraries to detect character encodings automatically, each with different trade-offs in speed, accuracy, and API design.

In this guide, you will learn how to detect text file encoding using chardet and charset-normalizer, handle large files efficiently with incremental detection, and build a robust file reader that works regardless of the source encoding.

Understanding the Problem

Python 3 defaults to UTF-8 when opening text files on most systems, but not all files are encoded in UTF-8. Files from Windows applications often use cp1252, legacy databases may produce latin1, and files from Asian systems might use shift_jis or gb2312. Opening a file with the wrong encoding either crashes with a UnicodeDecodeError or produces garbled text:

# This fails if the file is not actually UTF-8
try:
with open("legacy_data.txt", "r", encoding="utf-8") as f:
content = f.read()
except UnicodeDecodeError as e:
print(f"Error: {e}")

Possible output:

Error: 'utf-8' codec can't decode byte 0xe9 in position 15: invalid continuation byte

The solution is to detect the encoding first by analyzing the raw bytes, then open the file with the correct encoding.

tip

Always open files in binary mode ("rb") when detecting encoding. Text mode requires knowing the encoding beforehand, which is the very problem you are trying to solve.

Using chardet for Universal Detection

The chardet library is the traditional standard for encoding detection in Python. It analyzes byte patterns and returns the most likely encoding along with a confidence score:

pip install chardet

Basic Detection

For small to medium files, read a sample of bytes and pass them to chardet.detect():

import chardet

with open("unknown_file.txt", "rb") as f:
raw_data = f.read(10000)
result = chardet.detect(raw_data)

print(result)

Example output:

{'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''}

The result dictionary contains three fields:

  • encoding: the detected encoding name (usable directly in open())
  • confidence: a score between 0 and 1 indicating how certain the detection is
  • language: the detected language, if applicable

Incremental Detection for Large Files

For large files, loading 10,000 bytes may not provide enough data for accurate detection, but reading the entire file wastes memory. The UniversalDetector class solves this by analyzing content incrementally, stopping as soon as confidence reaches an acceptable threshold:

from chardet.universaldetector import UniversalDetector

detector = UniversalDetector()

with open("massive_log.txt", "rb") as f:
for line in f:
detector.feed(line)
if detector.done:
break

detector.close()
print(detector.result)

Example output:

{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

The detector processes only as many lines as needed, making it suitable for multi-gigabyte files without loading them entirely into memory.

Using charset-normalizer for Better Performance

The charset-normalizer library is a modern alternative that is faster, designed specifically for Python 3, and serves as the detection engine used internally by the popular requests library:

pip install charset-normalizer
from charset_normalizer import from_path

results = from_path("data.txt")
best = results.best()

if best:
print(f"Encoding: {best.encoding}")
# Access the decoded content directly
content = str(best)
print(f"Preview: {content[:100]}")
else:
print("Could not detect encoding")

Example output:

Encoding: utf-8
Preview: This is the first line of the file containing various characters...

The from_path() function analyzes the file and returns a ranked list of possible encodings. The best() method returns the most likely match.

Building a Robust File Reader

Combining detection with fallback strategies creates a reliable reader that handles files from virtually any source:

import chardet

def read_file_auto(filepath, sample_size=10000):
"""Read a text file with automatic encoding detection."""

# Try UTF-8 first (covers most modern files)
try:
with open(filepath, "r", encoding="utf-8") as f:
return f.read()
except UnicodeDecodeError:
pass

# Try UTF-8 with BOM (common in Excel exports)
try:
with open(filepath, "r", encoding="utf-8-sig") as f:
return f.read()
except UnicodeDecodeError:
pass

# Detect encoding from file sample
with open(filepath, "rb") as f:
result = chardet.detect(f.read(sample_size))

detected = result["encoding"]
confidence = result["confidence"]

if confidence < 0.7:
print(f"Warning: Low confidence detection - {detected} ({confidence:.0%})")

try:
with open(filepath, "r", encoding=detected) as f:
return f.read()
except (UnicodeDecodeError, TypeError):
# Last resort: latin1 maps all 256 byte values and never fails
print("Falling back to latin1 encoding")
with open(filepath, "r", encoding="latin1") as f:
return f.read()

# Usage
content = read_file_auto("mystery_file.txt")
print(f"Read {len(content)} characters")

The function tries encodings in order of likelihood. UTF-8 is attempted first since it covers the vast majority of modern files. If that fails, chardet analyzes the raw bytes. The latin1 fallback guarantees the file will load because it maps all 256 possible byte values to characters, though the decoded text may not be perfectly accurate.

Detecting Encoding for Multiple Files

When processing a batch of files from various sources, you can scan an entire directory and report the encodings:

import chardet
from pathlib import Path

def detect_encodings(folder: str, pattern: str = "*.txt") -> dict:
"""Detect encoding for all matching files in a folder."""
results = {}

for filepath in sorted(Path(folder).glob(pattern)):
with open(filepath, "rb") as f:
detection = chardet.detect(f.read(10000))

results[filepath.name] = {
"encoding": detection["encoding"],
"confidence": detection["confidence"]
}

return results

report = detect_encodings("./data_files")
for filename, info in report.items():
status = "OK" if info["confidence"] > 0.7 else "LOW CONFIDENCE"
print(f"{filename}: {info['encoding']} ({info['confidence']:.0%}) {status}")

Example output:

config.txt: ascii (confidence: 100%) OK
legacy_export.txt: Windows-1252 (confidence: 73%) OK
notes.txt: utf-8 (confidence: 99%) OK
old_data.txt: ISO-8859-1 (confidence: 55%) LOW CONFIDENCE

Comparison of Detection Libraries

Featurechardetcharset-normalizer
SpeedSlowerFaster
AccuracyHighHigh
Incremental detectionYes (UniversalDetector)No
Python 3 optimizedNo (ported from Python 2)Yes
Used by requestsNo (replaced in v2.28+)Yes
Installpip install chardetpip install charset-normalizer
warning

Encoding detection is never 100% accurate. It works by analyzing statistical patterns in byte sequences, which means short files or files with mostly ASCII content provide less information for the detector. The confidence score indicates reliability. Values below 0.7 suggest the detected encoding might be incorrect, and you should consider implementing fallback strategies for those cases.

Common Encodings Reference

EncodingTypical Source
utf-8Web APIs, Linux, modern applications
utf-8-sigExcel exports with Byte Order Mark
cp1252Windows legacy software (Western Europe)
latin1 / iso-8859-1Older databases, legacy systems
asciiPlain English text, configuration files
shift_jisJapanese Windows applications
gb2312 / gbkChinese text files

Conclusion

For most modern projects, charset-normalizer provides the best balance of speed and accuracy with a simple API.

  • Use chardet when you need incremental detection for very large files via the UniversalDetector class.
  • Always try UTF-8 first since it covers the majority of files you will encounter, and keep latin1 as a last-resort fallback that is guaranteed to read any file without errors.

Combine detection with explicit fallback strategies in a wrapper function to build file readers that handle encoding variations automatically and gracefully.