How to Remove Accents from Strings in Python
Removing accents (diacritics) from strings is a common task in text processing, especially for data cleaning and normalization.
This guide explores various methods for removing accents from strings in Python, including using the unidecode package and the built-in unicodedata module.
Removing Accents with the unidecode Package
The unidecode package is designed to transliterate Unicode characters into their closest ASCII equivalents. The first thing you should do is install the Unidecode package using pip:
pip install Unidecode
#or
pip3 install Unidecode
Then you can use the unidecode function:
from unidecode import unidecode
str_with_accents = 'ÂéüÒÑ'
str_without_accents = unidecode(str_with_accents)
print(str_without_accents) # Output: AeuON
- The
unidecode()function replaces Unicode characters with their closest ASCII equivalents.
Handling Non-Translatable Characters
By default, the library drops any characters that it can not translate to ASCII.
from unidecode import unidecode
str_with_accents = 'ÂéüÒÑ\ue123' # Non-translatable character is present
str_without_accents = unidecode(str_with_accents)
print(str_without_accents) # Output: AeuON
Specifying Error Handling
To raise an error if unidecode encounters a character it can not translate, set the errors parameter to 'strict':
from unidecode import unidecode, UnidecodeError
str_with_accents = 'ÂéüÒÑ\ue123'
try:
str_without_accents = unidecode(str_with_accents, errors='strict')
except UnidecodeError as e:
print(e.index) # Output: 5 (the position where the unknown character was encountered)
Preserving Non-Translatable Characters
To preserve unknown characters, set the errors parameter to 'preserve'. This will only work if the type of the returned variable is not forced to be ASCII compatible.
from unidecode import unidecode
str_with_accents = 'ÂéüÒÑ\ue123'
str_without_accents = unidecode(
str_with_accents,
errors='preserve',
)
print(str_without_accents) # Output: AeuON
The unicode replacement character is added to the characters that are not supported.
Removing Accents with unicodedata
The built-in unicodedata module is an alternative for removing accents, and doesn't require any external libraries to be installed.
import unicodedata
def remove_accents(string):
return ''.join(
char for char in unicodedata.normalize('NFD', string)
if unicodedata.category(char) != 'Mn'
)
str_with_accents = 'ÂéüÒÑ'
print(remove_accents(str_with_accents)) # Output: AeuON
print(remove_accents('Noël, Adrián, Sørina, Zoë, Renée')) # Output: Noel, Adrian, Sørina, Zoe, Renee
- The
unicodedata.normalize('NFD', string)decomposes the Unicode characters into their base characters and combining diacritical marks. - The generator expression then filters out characters with a category of
'Mn'which stands for Nonspacing_Mark code. - The two steps of normalizing, and then removing characters ensures that characters with diacritics are converted to their base equivalents, and all the accents are removed.