Skip to main content

How to Resolve a UnicodeDecodeError for a CSV File in Python

The UnicodeDecodeError is one of the most common errors encountered when reading CSV files in Python. It occurs when Python tries to decode the file's bytes using the wrong encoding scheme: typically defaulting to UTF-8 when the file was actually saved in a different encoding like Latin-1, UTF-16, or Windows-1252.

In this guide, you'll learn why this error happens, how to identify the correct encoding, and multiple methods to resolve it when working with CSV files in Pandas and Python.

Understanding the Error

Every text file is stored as a sequence of bytes on disk. An encoding scheme (like UTF-8, ASCII, or Latin-1) defines how those bytes map to characters. When Python reads a file, it must use the same encoding the file was saved with. If there's a mismatch, certain byte sequences can't be decoded, and Python raises a UnicodeDecodeError.

Simple Example

# ASCII can only represent characters 0–127
text = b"a".decode("ascii") # ✅ Works: 'a' is within ASCII range
print(text)

# Byte 0xf1 (ñ) is outside ASCII range
try:
text = b"a\xf1".decode("ascii")
except UnicodeDecodeError as e:
print(f"Error: {e}")

Output:

a
Error: 'ascii' codec can't decode byte 0xf1 in position 1: ordinal not in range(128)

The fix is to use an encoding that supports the byte 0xf1:

# ✅ Latin-1 (ISO-8859-1) supports bytes 0–255
text = b"a\xf1".decode("latin-1")
print(text) # Output: añ

The Error When Reading CSV Files

When using pd.read_csv(), Pandas defaults to UTF-8 encoding. If the CSV file was saved in a different encoding, you'll see an error like this:

import pandas as pd

# ❌ Fails if the file isn't UTF-8 encoded
try:
df = pd.read_csv('data.csv')
except UnicodeDecodeError as e:
print(f"Error: {e}")

Output:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Solution 1: Specify the Correct Encoding

If you know the file's encoding, pass it directly to pd.read_csv():

import pandas as pd

# ✅ Specify the correct encoding
df = pd.read_csv('data.csv', encoding='utf-16')
print(df.head())

Common Encodings

EncodingWhen to Use
utf-8Default for most modern files, web data, Linux/Mac systems
latin-1 (or iso-8859-1)Western European languages, older systems
utf-16Files from Windows apps, Excel exports
cp1252 (or windows-1252)Windows-generated files with special characters
asciiPlain English text with no special characters
utf-8-sigUTF-8 files with a BOM (Byte Order Mark), common from Excel
import pandas as pd

# Examples with different encodings
df = pd.read_csv('european_data.csv', encoding='latin-1')
df = pd.read_csv('windows_export.csv', encoding='cp1252')
df = pd.read_csv('excel_export.csv', encoding='utf-8-sig')

Solution 2: Detect the Encoding Automatically

When you don't know the file's encoding, use the chardet library to detect it:

pip install chardet
import chardet
import pandas as pd

# Detect the encoding
with open('data.csv', 'rb') as f:
raw_data = f.read()
result = chardet.detect(raw_data)

print(f"Detected encoding: {result['encoding']}")
print(f"Confidence: {result['confidence']:.0%}")

# Use the detected encoding
df = pd.read_csv('data.csv', encoding=result['encoding'])
print(df.head())

Output:

Detected encoding: UTF-16
Confidence: 100%

For large files, reading only a portion is more efficient:

import chardet

with open('large_file.csv', 'rb') as f:
# Read only the first 100KB for detection
raw_data = f.read(100_000)
result = chardet.detect(raw_data)

print(f"Detected: {result['encoding']} ({result['confidence']:.0%} confidence)")
tip

An alternative to chardet is the charset-normalizer library (used internally by the requests library), which is often faster and more accurate:

pip install charset-normalizer
from charset_normalizer import from_path

result = from_path('data.csv')
print(f"Detected: {result.best().encoding}")

Solution 3: Use latin-1 as a Fallback

Latin-1 (ISO-8859-1) can decode any byte value (0–255) without raising an error, making it a reliable fallback when you can't determine the correct encoding:

import pandas as pd

# ✅ latin-1 never raises UnicodeDecodeError
df = pd.read_csv('data.csv', encoding='latin-1')
print(df.head())
caution

While latin-1 will never raise a UnicodeDecodeError, it may misinterpret characters if the file's actual encoding is something else (like UTF-16 or Shift-JIS). Special characters may appear garbled. Use this as a temporary workaround while you identify the correct encoding.

Solution 4: Use errors Parameter to Handle Bad Bytes

Python's built-in open() function supports an errors parameter that controls how decoding errors are handled:

import pandas as pd
import io

# Option 1: Ignore problematic bytes
with open('data.csv', 'r', encoding='utf-8', errors='ignore') as f:
df = pd.read_csv(f)
print("With errors='ignore':")
print(df.head())

# Option 2: Replace problematic bytes with ''
with open('data.csv', 'r', encoding='utf-8', errors='replace') as f:
df = pd.read_csv(f)
print("\nWith errors='replace':")
print(df.head())
errors ValueBehavior
'strict'Raises UnicodeDecodeError (default)
'ignore'Silently skips undecodable bytes
'replace'Replaces bad bytes with `` (U+FFFD)
caution

Both 'ignore' and 'replace' can cause data loss or corruption. Characters may be silently dropped or replaced. Use these options only when you're confident the problematic bytes aren't critical to your analysis.

Solution 5: Convert the File Encoding

Instead of changing your code, you can convert the file itself to UTF-8 before reading it:

Using Python

import shutil

input_file = 'data_utf16.csv'
output_file = 'data_utf8.csv'

# Convert from UTF-16 to UTF-8
with open(input_file, 'r', encoding='utf-16') as source:
with open(output_file, 'w', encoding='utf-8') as target:
shutil.copyfileobj(source, target)

print(f"Converted '{input_file}' to UTF-8 as '{output_file}'")

Using a Text Editor

You can also convert the encoding using a text editor:

  1. Open the CSV file in Notepad, Notepad++, or VS Code.
  2. Go to File → Save As.
  3. Change the Encoding dropdown to UTF-8.
  4. Save the file.

After conversion, the file will work with pd.read_csv() without specifying an encoding.

Reusable Function for Safe CSV Reading

Here's a utility function that tries multiple encodings automatically:

import pandas as pd

def read_csv_safe(filepath, encodings=None, **kwargs):
"""
Attempt to read a CSV file by trying multiple encodings.

Args:
filepath: Path to the CSV file.
encodings: List of encodings to try. Defaults to common encodings.
**kwargs: Additional arguments passed to pd.read_csv().

Returns:
A Pandas DataFrame.
"""
if encodings is None:
encodings = ['utf-8', 'utf-8-sig', 'latin-1', 'cp1252', 'utf-16', 'ascii']

for encoding in encodings:
try:
df = pd.read_csv(filepath, encoding=encoding, **kwargs)
print(f"Successfully read with encoding: {encoding}")
return df
except (UnicodeDecodeError, UnicodeError):
continue

raise ValueError(f"Could not read '{filepath}' with any of the tried encodings: {encodings}")


# Usage
df = read_csv_safe('mystery_file.csv')
print(df.head())

Quick Troubleshooting Guide

Error MessageLikely CauseFix
can't decode byte 0xff in position 0File is UTF-16 encodedencoding='utf-16'
can't decode byte 0xe9 in position XFile uses Latin-1 or cp1252encoding='latin-1'
can't decode byte 0xef in position 0File has a BOM (Byte Order Mark)encoding='utf-8-sig'
ordinal not in range(128)File has non-ASCII charactersencoding='utf-8' or encoding='latin-1'

Summary

The UnicodeDecodeError occurs when Python tries to read a CSV file with the wrong encoding. To resolve it:

  1. Specify the correct encoding: pd.read_csv('file.csv', encoding='latin-1'): the best solution when you know the encoding.
  2. Detect the encoding: Use the chardet library to automatically identify the file's encoding.
  3. Use latin-1 as a fallback: It accepts all byte values and never raises an error, though characters may be misinterpreted.
  4. Handle errors gracefully: Use errors='ignore' or errors='replace' to skip or substitute problematic bytes.
  5. Convert the file: Re-save the file as UTF-8 using Python or a text editor.

The most robust approach is to detect the encoding first with chardet, then read the file with the correct encoding explicitly set.