Skip to content Skip to sidebar Skip to footer

How To Iterate Over Unicode Characters In Python 3?

I need to step through a Python string one character at a time, but a simple 'for' loop gives me UTF-16 code units instead: str = 'abc\u20ac\U00010302\U0010fffd' for ch in str:

Solution 1:

On Python 3.2.1 with narrow Unicode build:

PythonWin 3.2.1 (default, Jul 10 2011, 21:51:15) [MSC v.1500 32 bit (Intel)] on win32.
Portions Copyright 1994-2008 Mark Hammond - see 'Help/About PythonWin' for further copyright information.
>>>import sys>>>sys.maxunicode
65535

What you've discovered (UTF-16 encoding):

>>>s = "abc\u20ac\U00010302\U0010fffd">>>len(s)
8
>>>for c in s:...print('U+{:04X}'.format(ord(c)))...
U+0061
U+0062
U+0063
U+20AC
U+D800
U+DF02
U+DBFF
U+DFFD

A way around it:

>>>import struct>>>s=s.encode('utf-32-be')>>>struct.unpack('>{}L'.format(len(s)//4),s)
(97, 98, 99, 8364, 66306, 1114109)
>>>for i in struct.unpack('>{}L'.format(len(s)//4),s):...print('U+{:04X}'.format(i))...
U+0061
U+0062
U+0063
U+20AC
U+10302
U+10FFFD

Update for Python 3.3:

Now it works the way the OP expects:

>>>s = "abc\u20ac\U00010302\U0010fffd">>>len(s)
6
>>>for c in s:...print('U+{:04X}'.format(ord(c)))...
U+0061
U+0062
U+0063
U+20AC
U+10302
U+10FFFD

Solution 2:

Python normally stores the unicode values internally as UCS2. The UTF-16 representation of the UTF-32 \U00010302 character is \UD800\UDF02, that's why you got that result.

That said, there are some python builds that use UCS4, but these builds are not compatible with each other.

Take a look here.

Py_UNICODE This type represents the storage type which is used by Python internally as basis for holding Unicode ordinals. Python’s default builds use a 16-bit type for Py_UNICODE and store Unicode values internally as UCS2. It is also possible to build a UCS4 version of Python (most recent Linux distributions come with UCS4 builds of Python). These builds then use a 32-bit type for Py_UNICODE and store Unicode data internally as UCS4. On platforms where wchar_t is available and compatible with the chosen Python Unicode build variant, Py_UNICODE is a typedef alias for wchar_t to enhance native platform compatibility. On all other platforms, Py_UNICODE is a typedef alias for either unsigned short (UCS2) or unsigned long (UCS4).

Solution 3:

If you create the string as a unicode object, it should be able to break off a character at a time automatically. E.g.:

Python 2.6:

s = u"abc\u20ac\U00010302\U0010fffd"# note u in front!for c in s:
    print"U+%04x" % ord(c)

I received:

U+0061
U+0062
U+0063
U+20ac
U+10302
U+10fffd

Python 3.2:

s = "abc\u20ac\U00010302\U0010fffd"for c in s:
    print ("U+%04x" % ord(c))

It worked for me:

U+0061
U+0062
U+0063
U+20ac
U+10302
U+10fffd

Additionally, I found this link which explains that the behavior as working correctly. If the string came from a file, etc, it will likely need to be decoded first.

Update:

I've found an insightful explanation here. The internal Unicode representation size is a compile-time option, and if working with "wide" chars outside of the 16 bit plane you'll need to build python yourself to remove the limitation, or use one of the workarounds on this page. Apparently many Linux distros do this for you already as I encountered above.

Post a Comment for "How To Iterate Over Unicode Characters In Python 3?"