Why Doesn't Unicodedata Recognise Certain Characters?
Solution 1:
The unicodedata.name()
lookup relies on column 2 of the UnicodeData.txt database in the standard (Python 2.7 uses Unicode 5.2.0).
If that name starts with <
it is ignored. All control codes, including newlines, are in that category; the first column has no name other than <control>
:
000A;<control>;Cc;0;B;;;;;N;LINE FEED (LF);;;;
Column 10 is the old, Unicode 1.0 name, and should not be used, according to the standard. In other words, \n
has no name, other than the generic <control>
, which the Python database ignores (as it is not unique).
Python 3.3 added support for NameAliases.txt, which lets you look up names by alias; so lookup('LINE FEED')
, lookup('new line')
or lookup('eol')
, etc, all reference \n
. However, the unicodedata.name()
method does not support aliases, nor could it (which would it pick?):
- Added support for Unicode name aliases and named sequences. Both
unicodedata.lookup()
and'\N{...}'
now resolve name aliases, andunicodedata.lookup()
resolves named sequences too.
TL;DR: LINE FEED
is not the official name for \n
, it is but an alias for it. Python 3.3 and up let you look up characters by alias.
Post a Comment for "Why Doesn't Unicodedata Recognise Certain Characters?"