Why Doesn't Unicodedata Recognise Certain Characters?

August 28, 2023 Post a Comment

In Python 2.7 at least, unicodedata.name() doesn't recognise certain characters. >>> from unicodedata import name >>> name(u'\n') Traceback (most recent call last

Solution 1:

The unicodedata.name() lookup relies on column 2 of the UnicodeData.txt database in the standard (Python 2.7 uses Unicode 5.2.0).

If that name starts with < it is ignored. All control codes, including newlines, are in that category; the first column has no name other than <control>:

000A;<control>;Cc;0;B;;;;;N;LINE FEED (LF);;;;

Column 10 is the old, Unicode 1.0 name, and should not be used, according to the standard. In other words, \nhas no name, other than the generic <control>, which the Python database ignores (as it is not unique).

Python 3.3 added support for NameAliases.txt, which lets you look up names by alias; so lookup('LINE FEED'), lookup('new line') or lookup('eol'), etc, all reference \n. However, the unicodedata.name() method does not support aliases, nor could it (which would it pick?):

Added support for Unicode name aliases and named sequences. Both unicodedata.lookup() and '\N{...}' now resolve name aliases, and unicodedata.lookup() resolves named sequences too.

TL;DR: LINE FEED is not the official name for \n, it is but an alias for it. Python 3.3 and up let you look up characters by alias.

Python Guru

Why Doesn't Unicodedata Recognise Certain Characters?

Solution 1:

Post a Comment for "Why Doesn't Unicodedata Recognise Certain Characters?"