The transition to multilingual programming with Python
A recent thread on python-dev prompted me to summarise the current state of the ongoing industry-wide transition from bilingual to multilingual programming as it relates to Python’s cross-platform support. It also relates to the reasons why Python 3 turned out to be more disruptive than the core development team initially expected.
A good starting point for anyone interested in exploring this topic further is the “Origin and development” section of the Wikipedia article on Unicode, but I’ll hit the key points below.
At their core, computers only understand single bits. Everything above that is based on conventions that ascribe higher level meanings to particular sequences of bits. One particular important set of conventions for communicating between humans and computers are “text encodings”: conventions that map particular sequences of bits to text in the actual languages humans read and write.
One of the oldest encodings still in common use is ASCII (which stands for “American Standard Code for Information Interchange”), developed during the 1960’s (it just had its 50th birthday in 2013). This encoding maps the letters of the English alphabet (in both upper and lower case), the decimal digits, various punctuation characters and some additional “control codes” to the 128 numbers that can be encoded as a 7-bit sequence.
Many computer systems today still only work correctly with English – when you encounter such a system, it’s a fairly good bet that either the system itself, or something it depends on, is limited to working with ASCII text. (If you’re really unlucky, you might even get to work with modal 5-bit encodings like ITA-2, as I have. The legacy of the telegraph lives on!)
Working with local languages
The first attempts at dealing with this limitation of ASCII simply assigned meanings to the full range of 8-bit sequences. Known collectively as “Extended ASCII”, each of these systems allowed for an additional 128 characters, which was enough to handle many European and Cyrillic scripts. Even 256 characters was nowhere near sufficient to deal with Indic or East Asian languages, however, so this time also saw a proliferation of ASCII incompatible encodings like ShiftJIS, ISO-2022 and Big5. This is why Python ships with support for dozens of codecs from around the world.
This proliferation of encodings required a way to tell software which encoding should be used to read the data. For protocols that were originally designed for communication between computers, agreeing on a common text encoding is usually handled as part of the protocol. In cases where no encoding information is supplied (or to handle cases where there is a mismatch between the claimed encoding and the actual encoding), then applications may make use of “encoding detection” algorithms, like those provided by the chardet package for Python. These algorithms aren’t perfect, but can give good answers when given a sufficient amount of data to work with.
Local operating system interfaces, however, are a different story. Not only don’t they inherently convey encoding information, but the nature of the problem is such that trying to use encoding detection isn’t practical. Two key systems arose in an attempt to deal with this problem:
- Windows code pages
- POSIX locale encodings
With both of these systems, a program would pick a code page or locale, and use the corresponding text encoding to decide how to interpret text for display to the user or combination with other text. This may include deciding how to display information about the contents of the computer itself (like listing the files in a directory).
The fundamental premise of these two systems is that the computer only needs to speak the language of its immediate users. So, while the computer is theoretically capable of communicating in any language, it can effectively only communicate with humans in one language at a time. All of the data a given application was working with would need to be in a consistent encoding, or the result would be uninterpretable nonsense, something the Japanese (and eventually everyone else) came to call mojibake.
It isn’t a coincidence that the name for this concept came from an Asian country: the encoding problems encountered there make the issues encountered with European and Cyrillic languages look trivial by comparison.
Unfortunately, this “bilingual computing” approach (so called because the computer could generally handle English in addition to the local language) causes some serious problems once you consider communicating between computers. While some of those problems were specific to network protocols, there are some more serious ones that arise when dealing with nominally “local” interfaces:
- networked computing meant one username might be used across multiple systems, including different operating systems
- network drives allow a single file server to be accessed from multiple clients, including different operating systems
- portable media (like DVDs and USB keys) allow the same filesystem to be accessed from multiple devices at different points in time
- data synchronisation services like Dropbox need to faithfully replicate a filesystem hierarchy not only across different desktop environments, but also to mobile devices
For these protocols that were originally designed only for local interoperability communicating encoding information is generally difficult, and it doesn’t necessarily match the claimed encoding of the platform you’re running on.
Unicode and the rise of multilingual computing
The path to addressing the fundamental limitations of bilingual computing actually started more than 25 years ago, back in the late 1980’s. An initial draft proposal for a 16-bit “universal encoding” was released in 1988, the Unicode Consortium was formed in early 1991 and the first volume of the first version of Unicode was published later that same year.
Microsoft added new text handling and operating system APIs to Windows based on the 16-bit C level
wchar_t type, and Sun also adopted Unicode as part of the core design of Java’s approach to handling text.
However, there was a problem. The original Unicode design had decided that “16 bits ought to be enough for anybody” by restricting their target to only modern scripts, and only frequently used characters within those scripts. However, when you look at the “rarely used” Kanji and Han characters for Japanese and Chinese, you find that they include many characters that are regularly used for the names of people and places – they’re just largely restricted to proper nouns, and so won’t show up in a normal vocabulary search. So Unicode 2.0 was defined in 1996, expanding the system out to a maximum of 21 bits per code point (using up to 32 bits per code point for storage).
As a result, Windows (including the CLR) and Java now use the little-endian variant of UTF-16 to allow their text APIs to handle arbitrary Unicode code points. The original 16-bit code space is now referred to as the Basic Multilingual Plane.
While all that was going on, the POSIX world ended up adopting a different strategy for migrating to full Unicode support: attempting to standardise on the ASCII compatible UTF-8 text encoding.
The choice between using UTF-8 and UTF-16-LE as the preferred local text encoding involves some complicated trade-offs, and that’s reflected in the fact that they have ended up being at the heart of two competing approaches to multilingual computing.
Choosing UTF-8 aims to treat formatting text for communication with the user as “just a display issue”. It’s a low impact design that will “just work” for a lot of software, but it comes at a price:
- because encoding consistency checks are mostly avoided, data in different encodings may be freely concatenated and passed on to other applications. Such data is typically not usable by the receiving application.
- for interfaces without encoding information available, it is often necessary to assume an appropriate encoding in order to display information to the user, or to transform it to a different encoding for communication with another system that may not share the local system’s encoding assumptions. These assumptions may not be correct, but won’t necessarily cause an error – the data may just be silently misinterpreted as something other than what was originally intended.
- because data is generally decoded far from where it was introduced, it can be difficult to discover the origin of encoding errors.
- as a variable width encoding, it is more difficult to develop efficient string manipulation algorithms for UTF-8. Algorithms originally designed for fixed width encodings will no longer work.
- as a specific instance of the previous point, it isn’t possible to split UTF-8 encoded text at arbitrary locations. Care needs to be taken to ensure splits only occur at code point boundaries.
UTF-16-LE shares the last two problem, but to a lesser degree (simply due to the fact most commonly used code points are in the 16-bit Basic Multilingual Plane). However, because it isn’t generally suitable for use in network protocols and file formats (without significant additional encoding markers), the explicit decoding and encoding required encourages designs with a clear separation between binary data (including encoded text) and decoded text data.
Through the lens of Python
Python and Unicode were born on opposites side of the Atlantic ocean at roughly the same time (1991). The growing adoption of Unicode within the computing industry has had a profound impact on the evolution of the language.
Python 1.x was purely a product of the bilingual computing era – it had no support for Unicode based text handling at all, and was hence largely limited to 8-bit ASCII compatible encodings for text processing.
Python 2.x was still primarily a product of the bilingual era, but added multilingual support as an optional addon, in the form of the
unicode type and support for a wide variety of text encodings. PEP 100 goes into the many technical details that needed to be covered in order to incorporate that feature. With Python 2, you can make multilingual programming work, but it requires an active decision on the part of the application developer, or at least that they follow the guidelines of a framework that handles the problem on their behalf.
By contrast, Python 3.x is designed to be a native denizen of the multilingual computing world. Support for multiple languages extends as far as the variable naming system, such that languages other than English become almost as well supported as English already was in Python 2. While the English inspired keywords and the English naming in the standard library and on the Python Package Index mean that Python’s “native” language and the preferred language for global collaboration will always be English, the new design allows a lot more flexibility when working with data in other languages.
Consider processing a data table where the headings are names of Japanese individuals, and we’d like to use
collections.namedtuple to process each row. Python 2 simply can’t handle this task:
>>> from collections import namedtuple >>> People = namedtuple("People", u"陽斗 慶子 七海") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib64/python2.7/collections.py", line 310, in namedtuple field_names = map(str, field_names) UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
Users need to either restrict themselves to dictionary style lookups rather than attribute access, or else used romanised versions of their names (Haruto, Keiko, Nanami for the example). However, the case of “Haruto” is an interesting one, as there at least 3 different ways of writing that as Kanji (陽斗, 陽翔, 大翔), but they are all romanised as the same string (Haruto). If you try to use romaaji to handle a data set that contains more than one variant of that name, you’re going to get spurious collisions.
Python 3 takes a very different perspective on this problem. It says it should just work, and it makes sure it does:
>>> from collections import namedtuple >>> People = namedtuple("People", u"陽斗 慶子 七海") >>> d = People(1, 2, 3) >>> d.陽斗 1 >>> d.慶子 2 >>> d.七海 3
This change greatly expands the kinds of “data driven” use cases Python can support in areas where the ASCII based assumptions of Python 2 would cause serious problems.
Python 3 still needs to deal with improperly encoded data however, so it provides a mechanism for arbitrary binary data to be “smuggled” through text strings in the Unicode low surrogate area. This feature was added by PEP 383 and is managed through the
surrogateescape error handler, which is used by default on most operating system interfaces. This recreates the old Python 2 behaviour of passing improperly encoded data through unchanged when dealing solely with local operating system interfaces, but complaining when such improperly encoded data is injected into another interface. The codec error handling system provides several tools to deal with these files, and we’re looking at adding a few more relevant convenience functions for Python 3.5.
The underlying Unicode changes in Python 3 also made PEP 393 possible, which changed the way the CPython interpreter stores text internally. In Python 2, even pure ASCII strings would consume four bytes per code point on Linux systems. Using the “narrow build” option (as the Python 2 Windows builds from python.org do) reduced that the only two bytes per code point when operating within the Basic Multilingual Plane, but at the cost of potentially producing wrong answers when asked to operate on code points outside the Basic Multilingual Plane. By contrast, starting with Python 3.3, CPython now stores text internally using the smallest fixed width data unit possible. That is,
latin-1 text uses 8 bits per code point,
UCS-2 (Basic Multilingual Plane) text uses 16-bits per code point, and only text containing code points outside the Basic Multilingual Plane will expand to needing the full 32 bits per code point. This can not only significantly reduce the amount of memory needed for multilingual applications, but may also increase their speed as well (as reducing memory usage also reduces the time spent copying data around).
Are we there yet?
In a word, no. Not for Python 3.4, and not for the computing industry at large. We’re much closer than we ever have been before, though. Most POSIX systems now default to UTF-8 as their default encoding, and many systems offer a
C.UTF-8 locale as an alternative to the traditional ASCII based
C locale. When dealing solely with properly encoded data and metadata, and properly configured systems, Python 3 should “just work”, even when exchanging data between different platforms.
For Python 3, the remaining challenges fall into a few areas:
- helping existing Python 2 users adopt the optional multilingual features that will prepare them for eventual migration to Python 3 (as well as reassuring those users that don’t wish to migrate that Python 2 is still fully supported, and will remain so for at least the next several years, and potentially longer for customers of commercial redistributors)
- adding back some features for working entirely in the binary domain that were removed in the original Python 3 transition due to an initial assessment that they were operations that only made sense on text data (PEP 461 summary:
bytes.__mod__is coming back in Python 3.5 as a valid binary domain operation,
bytes.formatstays gone as an operation that only makes sense when working with actual text data)
- better handling of improperly decoded data, including poor encoding recommendations from the operating system (for example, Python 3.5 will be more sceptical when the operating system tells it the preferred encoding is
ASCIIand will enable the
surrogateescapeerror handler on
sys.stdoutwhen it occurs)
- eliminating most remaining usage of the legacy code page and locale encoding systems in the CPython interpreter (this most notably affects the Windows console interface and argument decoding on POSIX. While these aren’t easy problems to solve, it will still hopefully be possible to address them for Python 3.5)
More broadly, each major platform has its own significant challenges to address:
- for POSIX systems, there are still a lot of systems that don’t use UTF-8 as the preferred encoding and the assumption of ASCII as the preferred encoding in the default
Clocale is positively archaic. There is also still a lot of POSIX software that still believes in the “text is just encoded bytes” assumption, and will happily produce mojibake that makes no sense to other applications or systems.
- for Windows, keeping the old 8-bit APIs around was deemed necessary for backwards compatibility, but this also means that there is still a lot of Windows software that simply doesn’t handle multilingual computing correctly.
- for both Windows and the JVM, a fair amount of nominally multilingual software actually only works correctly with data in the basic multilingual plane. This is a smaller problem than not supporting multilingual computing at all, but was quite a noticeable problem in Python 2’s own Windows support.
Mac OS X is the platform most tightly controlled by any one entity (Apple), and they’re actually in the best position out of all of the current major platforms when it comes to handling multilingual computing correctly. They’ve been one of the major drivers of Unicode since the beginning (two of the authors of the initial Unicode proposal were Apple engineers), and were able to force the necessary configuration changes on all their systems, rather than having to work with an extensive network of OEM partners (Windows, commercial Linux vendors) or relatively loose collaborations of individuals and organisations (community Linux distributions).
Modern mobile platforms are generally in a better position than desktop operating systems, mostly by virtue of being newer, and hence defined after Unicode was better understood. However, the UTF-8 vs UTF-16-LE distinction for text handling exists even there, thanks to the Java inspired Dalvik VM in Android (plus the cloud-backed nature of modern smartphones means you’re even more likely to be encounter files from multiple machines when working on a mobile device).
Also posted here: The transition to multilingual programming | Curious Efficiency.
Update (15th September, 2014): This article originally stated that the “surrogateespace” codec error handler smuggled bytes in the Unicode Private Use Area. While such an approach was originally discussed, the final design chosen actually uses 128 bytes from the “low surrogate area”.
Update (15th September, 2014): The link for the restoration of bytes.__mod__ support has been updated to refer to the correct Python Enhancement Proposal.
Join the Red Hat Developer Program (it’s free) and get access to related cheat sheets, books, and product downloads.