How Well Does Your Language Support Unicode In Practice?

April 29, 2023 Post a Comment

I'm looking into new languages, kind of craving for one where I no longer need to worry about charset problems amongst inordinate amounts of other niggles I have with PHP for a new

Solution 1:

Python's unicode support did not really change in 3.x. The unicode support in Python has been pretty much the same since Python 2.x, which introduced the separate unicode type and the encoding handling. What Python 3.x changes is that unicode becomes the only string type (and is renamed to str), whereas 2.x has bytestrings (str, "...") and unicode strings (unicode, u"...") that often but not always don't quite mix. (Allowing them to mix was an attempt to make transitioning from bytestrings to unicode easier, but it turned out a mistake.) All in all, Python's unicode support is quite good, mistakes in Python 2.x notwithstanding. There's unicode literals with numeric and named escapes, source-encoding declarations for non-ASCII characters in unicode literals, automatic encoding/decoding through the codecs module, unicode support in many libraries (like the regular expression and DB-API modules) and a builtin unicode database.

That said, you still need to know about encodings in order to handle text correctly. Your program will receive bytes in some encoding (be it from files, from environment variables or through other input) and they will need to be interpreted in that encoding. If you don't know the encoding (and can't determine it from the data, like in HTML or XML) you can really only process the data as bytes. If you do know the encoding, Python does allow you to deal with it mostly transparently.

Solution 2:

Perl has excellent support of unicode. You need to know how to use is properly, but i never find any language what has better unicode support than perl, especially now with perl5.14.

Solution 3:

Racket (in the Lisp/Scheme camp) has good Unicode support. Racket distinguishes character strings (written "abc") from byte strings (written #"abc"). Character strings consist of Unicode characters and have all the Unicode-aware string operations one would expect (comparison, case folding, etc). By default Racket uses UTF-8 for character string I/O (including the encoding of source files), but it also supports conversion to and from other encodings. The GUI toolkit works with Unicode. So do regular expressions.

Solution 4:

From my personal experience, Ruby 1.9.2 handles unicode internally pretty good except some strange areas like upcase/downcase/capitalize methods for String class. I have to override them for all my Rails applications.

Solution 5:

Lisps have strong support for unicode. All modern popular lisps (SBCL, Clozure CL, clisp) use UTF-32/UCS-4 for strings and support UTF-8 as an external format.

Python Guru