List Of Unicode Character Names

August 15, 2022 Post a Comment

In Python I can print a unicode character by name (e.g. print(u'\N{snowman}')). Is there a way I get get a list of all valid names?

Solution 1:

Every codepoint has a name, so you are effectively asking for the Unicode standard list of codepoint names (as well as the *list of name aliases, supported by Python 3.3 and up).

Each Python version supports a specific version of the Unicode standard; the unicodedata.unidata_version attribute tells you which one for a given Python runtime. The above links lead to the latest published Unicode version, replace UCD/latest in the URLs with the value of unicodedata.unidata_version for your Python version.

Per codepoint, the unicodedata.name() function can tell you the official name, and unicodedata.lookup() gives you the inverse (name to codepoint).

Solution 2:

If you want a list of all unicode character names, consider downloading the Unicode Character Database.

It is included in the base repositories of many linux distributions (ex. "unicode-ucd" on RHEL).

The package includes NamesList.txt, which contains the exhaustive list of unicode character names.

Caution: NamesList.txt need some times to be downloaded (size > 1.5 MB).

Example:

21FE    RIGHTWARDS OPEN-HEADED ARROW
21FF    LEFT RIGHT OPEN-HEADED ARROW
@@  2200    Mathematical Operators  22FF
@@+
@       Miscellaneous mathematical symbols
2200    FOR ALL
    = universal quantifier
2201    COMPLEMENT
    x (latin letter stretched c - 0297)
2202    PARTIAL DIFFERENTIAL
2203    THERE EXISTS
    = existential quantifier
2204    THERE DOES NOT EXIST
    : 2203 0338
2205    EMPTY SET
    = null set
    * used in linguistics to indicate a null morpheme or phonological "zero"
    x (latin capital letter o with stroke - 00D8)
    x (diameter sign - 2300)
    ~ 2205 FE00 zero with long diagonal stroke overlay form

Solution 3:

Yes there is a way. Going through all existing code points and calling unicodedata.name() on each of them. Like this:

names = []
for c in range(0, 0x10FFFF + 1):
    try:
        names.append(unicodedata.name(c))
    except KeyError:
        pass
# Do something with names

Solution 4:

For a given codepoint, you can use unicodedata.name. To get them all, you can work through all the billions to see which have such names.

Solution 5:

Just print them all:

import unicodedata 

for i in range(0x110000): 
    character = chr(i) 
    name = unicodedata.name(character, "") 
    if len(name) > 0: 
        print(f"{i:6} | 0x{i:04X} | {character} | {name}")

Python Guru