How to Resolve a UnicodeDecodeError for a CSV File in Python
The UnicodeDecodeError is one of the most common errors encountered when reading CSV files in Python. It occurs when Python tries to decode the file's bytes using the wrong encoding scheme: typically defaulting to UTF-8 when the file was actually saved in a different encoding like Latin-1, UTF-16, or Windows-1252.
In this guide, you'll learn why this error happens, how to identify the correct encoding, and multiple methods to resolve it when working with CSV files in Pandas and Python.
Understanding the Error
Every text file is stored as a sequence of bytes on disk. An encoding scheme (like UTF-8, ASCII, or Latin-1) defines how those bytes map to characters. When Python reads a file, it must use the same encoding the file was saved with. If there's a mismatch, certain byte sequences can't be decoded, and Python raises a UnicodeDecodeError.
Simple Example
# ASCII can only represent characters 0–127
text = b"a".decode("ascii") # ✅ Works: 'a' is within ASCII range
print(text)
# Byte 0xf1 (ñ) is outside ASCII range
try:
text = b"a\xf1".decode("ascii")
except UnicodeDecodeError as e:
print(f"Error: {e}")
Output:
a
Error: 'ascii' codec can't decode byte 0xf1 in position 1: ordinal not in range(128)
The fix is to use an encoding that supports the byte 0xf1:
# ✅ Latin-1 (ISO-8859-1) supports bytes 0–255
text = b"a\xf1".decode("latin-1")
print(text) # Output: añ
The Error When Reading CSV Files
When using pd.read_csv(), Pandas defaults to UTF-8 encoding. If the CSV file was saved in a different encoding, you'll see an error like this:
import pandas as pd
# ❌ Fails if the file isn't UTF-8 encoded
try:
df = pd.read_csv('data.csv')
except UnicodeDecodeError as e:
print(f"Error: {e}")
Output:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
Solution 1: Specify the Correct Encoding
If you know the file's encoding, pass it directly to pd.read_csv():
import pandas as pd
# ✅ Specify the correct encoding
df = pd.read_csv('data.csv', encoding='utf-16')
print(df.head())
Common Encodings
| Encoding | When to Use |
|---|---|
utf-8 | Default for most modern files, web data, Linux/Mac systems |
latin-1 (or iso-8859-1) | Western European languages, older systems |
utf-16 | Files from Windows apps, Excel exports |
cp1252 (or windows-1252) | Windows-generated files with special characters |
ascii | Plain English text with no special characters |
utf-8-sig | UTF-8 files with a BOM (Byte Order Mark), common from Excel |
import pandas as pd
# Examples with different encodings
df = pd.read_csv('european_data.csv', encoding='latin-1')
df = pd.read_csv('windows_export.csv', encoding='cp1252')
df = pd.read_csv('excel_export.csv', encoding='utf-8-sig')
Solution 2: Detect the Encoding Automatically
When you don't know the file's encoding, use the chardet library to detect it:
pip install chardet
import chardet
import pandas as pd
# Detect the encoding
with open('data.csv', 'rb') as f:
raw_data = f.read()
result = chardet.detect(raw_data)
print(f"Detected encoding: {result['encoding']}")
print(f"Confidence: {result['confidence']:.0%}")
# Use the detected encoding
df = pd.read_csv('data.csv', encoding=result['encoding'])
print(df.head())
Output:
Detected encoding: UTF-16
Confidence: 100%
For large files, reading only a portion is more efficient:
import chardet
with open('large_file.csv', 'rb') as f:
# Read only the first 100KB for detection
raw_data = f.read(100_000)
result = chardet.detect(raw_data)
print(f"Detected: {result['encoding']} ({result['confidence']:.0%} confidence)")
An alternative to chardet is the charset-normalizer library (used internally by the requests library), which is often faster and more accurate:
pip install charset-normalizer
from charset_normalizer import from_path
result = from_path('data.csv')
print(f"Detected: {result.best().encoding}")
Solution 3: Use latin-1 as a Fallback
Latin-1 (ISO-8859-1) can decode any byte value (0–255) without raising an error, making it a reliable fallback when you can't determine the correct encoding:
import pandas as pd
# ✅ latin-1 never raises UnicodeDecodeError
df = pd.read_csv('data.csv', encoding='latin-1')
print(df.head())
While latin-1 will never raise a UnicodeDecodeError, it may misinterpret characters if the file's actual encoding is something else (like UTF-16 or Shift-JIS). Special characters may appear garbled. Use this as a temporary workaround while you identify the correct encoding.
Solution 4: Use errors Parameter to Handle Bad Bytes
Python's built-in open() function supports an errors parameter that controls how decoding errors are handled:
import pandas as pd
import io
# Option 1: Ignore problematic bytes
with open('data.csv', 'r', encoding='utf-8', errors='ignore') as f:
df = pd.read_csv(f)
print("With errors='ignore':")
print(df.head())
# Option 2: Replace problematic bytes with ''
with open('data.csv', 'r', encoding='utf-8', errors='replace') as f:
df = pd.read_csv(f)
print("\nWith errors='replace':")
print(df.head())
errors Value | Behavior |
|---|---|
'strict' | Raises UnicodeDecodeError (default) |
'ignore' | Silently skips undecodable bytes |
'replace' | Replaces bad bytes with `` (U+FFFD) |
Both 'ignore' and 'replace' can cause data loss or corruption. Characters may be silently dropped or replaced. Use these options only when you're confident the problematic bytes aren't critical to your analysis.
Solution 5: Convert the File Encoding
Instead of changing your code, you can convert the file itself to UTF-8 before reading it:
Using Python
import shutil
input_file = 'data_utf16.csv'
output_file = 'data_utf8.csv'
# Convert from UTF-16 to UTF-8
with open(input_file, 'r', encoding='utf-16') as source:
with open(output_file, 'w', encoding='utf-8') as target:
shutil.copyfileobj(source, target)
print(f"Converted '{input_file}' to UTF-8 as '{output_file}'")
Using a Text Editor
You can also convert the encoding using a text editor:
- Open the CSV file in Notepad, Notepad++, or VS Code.
- Go to File → Save As.
- Change the Encoding dropdown to UTF-8.
- Save the file.
After conversion, the file will work with pd.read_csv() without specifying an encoding.
Reusable Function for Safe CSV Reading
Here's a utility function that tries multiple encodings automatically:
import pandas as pd
def read_csv_safe(filepath, encodings=None, **kwargs):
"""
Attempt to read a CSV file by trying multiple encodings.
Args:
filepath: Path to the CSV file.
encodings: List of encodings to try. Defaults to common encodings.
**kwargs: Additional arguments passed to pd.read_csv().
Returns:
A Pandas DataFrame.
"""
if encodings is None:
encodings = ['utf-8', 'utf-8-sig', 'latin-1', 'cp1252', 'utf-16', 'ascii']
for encoding in encodings:
try:
df = pd.read_csv(filepath, encoding=encoding, **kwargs)
print(f"Successfully read with encoding: {encoding}")
return df
except (UnicodeDecodeError, UnicodeError):
continue
raise ValueError(f"Could not read '{filepath}' with any of the tried encodings: {encodings}")
# Usage
df = read_csv_safe('mystery_file.csv')
print(df.head())
Quick Troubleshooting Guide
| Error Message | Likely Cause | Fix |
|---|---|---|
can't decode byte 0xff in position 0 | File is UTF-16 encoded | encoding='utf-16' |
can't decode byte 0xe9 in position X | File uses Latin-1 or cp1252 | encoding='latin-1' |
can't decode byte 0xef in position 0 | File has a BOM (Byte Order Mark) | encoding='utf-8-sig' |
ordinal not in range(128) | File has non-ASCII characters | encoding='utf-8' or encoding='latin-1' |
Summary
The UnicodeDecodeError occurs when Python tries to read a CSV file with the wrong encoding. To resolve it:
- Specify the correct encoding:
pd.read_csv('file.csv', encoding='latin-1'): the best solution when you know the encoding. - Detect the encoding: Use the
chardetlibrary to automatically identify the file's encoding. - Use
latin-1as a fallback: It accepts all byte values and never raises an error, though characters may be misinterpreted. - Handle errors gracefully: Use
errors='ignore'orerrors='replace'to skip or substitute problematic bytes. - Convert the file: Re-save the file as UTF-8 using Python or a text editor.
The most robust approach is to detect the encoding first with chardet, then read the file with the correct encoding explicitly set.