Skip to content Skip to sidebar Skip to footer

Unicodeencodeerror: 'ascii' Codec Can't Encode Character '\xe9' When Printing In Utf-8 Locale

I am cleaning the monolingual corpus of Europarl for French (http://data.statmt.org/wmt19/translation-task/fr-de/monolingual/europarl-v7.fr.gz). The original raw data in .gzfile (I

Solution 1:

The UnicodeEncodeError is occurring because when printing, Python encodes strings to bytes, but in this case, the encoding being used - ASCII - has no character that matches '\xe9', so the error is raised.

Setting the PYTHONIOENCODING environment variable forces Python to use a different encoding - the value of the environment variable. The UTF-8 encoding can encode any character, so calling the program like this solves the issue:

PYTHONIOENCODING=UTF-8 python3  europarl_extractor.py

assuming the code is something like this:

import gzip

if __name__ == '__main__':
    with gzip.open('europarl-v7.fr.gz', 'rb') as f_in:
        bs = f_in.read()
        txt = bs.decode('utf-8')
        print(txt[:100])

The environment variable may be set in other ways - via an export statement, in .bashrc, .profile etc.

An interesting question is why Python is trying to encode output as ASCII. I had assumed that on *nix systems, Python essentially looked at the $LANG environment variable to determine the encoding to use. But in the case the value of $LANG is fr_FR.UTF-8, and yet Python is using ASCII as the output encoding.

From looking at the source for the locale module, and this FAQ, these environment variables are checked, in order:

'LC_ALL', 'LC_CTYPE', 'LANG', 'LANGUAGE'

So it may be that one of LC_ALL or LC_CTYPE has been set to a value that mandates ASCII encoding in your environment (you can check by running the locale command in your terminal; also running locale charmap will tell you the encoding itself).

Solution 2:

Many thanks for all your help! I found a simple solution to work around. I'm not sure why it works but I think that maybe the .txt format is supported somehow? If you know the mechanism, it would be extremely helpful to know.

with gzip.open(file_path, 'rb') as f_in:
    text = f_in.read()

with open(os.path.join(out_dir, 'europarl.txt'), 'wb') as f_out:
    f_out.write(text)

When I print out the text file in terminal, it looks like this:

Reprise de la session Je déclare reprise la session du Parlement européen qui avait été interrompue le vendredi 17 décembre dernier et je vous renouvelle tous mes vux en espérant que vous avez passé de bonnes vacances. Comme vous avez pu le constater, le grand "bogue de l'an 2000" ne s'est pas produit. En revanche, les citoyens d'un certain nombre de nos pays ont été victimes de catastrophes naturelles qui ont vraiment été terribles. Vous avez souhaité un débat à ce sujet dans les prochains jours, au cours de cette période de session.

Post a Comment for "Unicodeencodeerror: 'ascii' Codec Can't Encode Character '\xe9' When Printing In Utf-8 Locale"