Fun with internationalization, ISO 3166-1, ISO 639 and the CLDR

3:13pm 8th January 2008

So I came to a point in a software project that I am working on where it is time to consider how foreign users will be able to use the system. Yep, it's time for i18n and L10n. Initially, I thought that this would be as easy as simply setting up a translation matrix, and wrapping all static data in a function that performs a lookup on it and outputs the appropriate language.

This worked fine, I set up the matrix using a list of languages maintained in the database according to the ISO 639-1 standard. I wrote a simple yet effective tool for maintaining translation tables and began using online translation services to translate the most simple words and phrases, performing reverse translations on them to ensure they were not contextually mutated by the translator. Most words could be done this way, and in a very short period of time I was able to have substantial portions of the site in other languages including non-latin based languages such as Russian, Greek and Thai. All was well in the world.

Or so I thought. It all hit me at once. A single thought that brought all my cleverness crashing down, and threw up a whole new set of problems to overcome. It was like the opposite of an epiphany. Here's the situation: If I have a list of languages in the ISO 639 list, how do I handle the fact that the same language may actually need two translations? E.g., how would I deal with translating the word "colour" for the American spelling? The ISO code 'EN' does not allow for more than one English language. And then there's Portuguese; it has two regional variants, that spoken in Portugal and that spoken in Brazil. Finally, Chinese. ISO 639 only specified a single language, 'ZH', however there are two written forms, traditional and simplified, which are practically different languages.

It was clear that my solution was just not sufficient. Back to the drawing board.

After pondering this problem for some time, I decided that if I was going to do i18n at all, I was going to do it properly. After all, I've been uncompromising in my support for and awareness of time zone handling. Additionally, I had been careful from the word go about using UTF-8 aware functionality in the code, the database and all ancillary systems. It'd be silly to compromise on proper i18n at this point. That meant using appropriate date formats, spelling variants, weekday/weekend definitions, everything. But how? I had nowhere near the resources to engage in the collation and maintenance of that information. Nonetheless, I was determined that it be done right.

I was given salvation by another IRCer who is also working with this at the moment. He told me about the Common Locale Data Repository, or CLDR for short. It is a large, freely available repository of locale data stored and distributed in a widely used format, an XML DTD called LDML. This DTD is used in many projects for the exchange of locale data, including Microsoft's .NET framework. A relatively new project, having only started in 2003, the CLDR is already by far the most comprehensive repository of locale data that is freely available. It is maintained by the Unicode consortium, so one can be certain that things like format stability, backwards compatibility and data consistency are going to be given due attention.

Implementing full i18n/L10n in my project will be fairly involved, but not difficult. Locale identifiers are made up of ISO 639 codes and ISO 3166-1 alpha 2 codes, of which I already have authoritative sources. Mating them for valid locales is a trivial job, and now instead of my translation table being populated on one axis by language, it now has locale with a default "fallback" language, meaning very little change is required to the infrastructure already written to support language variants from trivial one or two word changes through to wholly different languages with different character sets. Furthermore, it does not seem that CLDR data even needs to be integrated into the DB before being used, the XML file can be stored locally and queried directly, meaning that future versions of the file become a drop-in replacement allowing virtually effortless expansion of locale awareness. The CLDR allows me to trivially do the following:

  • Translate basic data such as month names, day of the week names and names of countries into many languages.
  • Format dates and times according to local conventions.
  • Perform character repertoire tests to guarantee that the fonts used include the all necessary characters to render a language fully.
  • Because the CLDR uses the same time zone list as most posix systems, which is part of the zoneinfo database, I can use it to translate time zone names into local languages
  • Determine local currency and its symbol.

In order to serve a user's needs, I will need to get as many of the following details from them in this order:

  1. Locale: As a basic minimum, this will allow the representation of content in their local language and formatting of dates in their expected format.
  2. Location: As users may use locales for places that are not where they are located physically, a user may want to specify their country so that they can be shown appropriate location based settings and defaults.
  3. Timezone: This cannot be reliably inferred from the above two pieces of information, although it can be reasonably accurately guessed. The user will however need the option of changing it, should they wish.

In order to allow full localization, those three pieces of information need to be determined about the user. Much of it can be inferred by using things like IP based geolocation data providers such as IP2Location and MaxMind. The ultimate goal is for a user to just hit the web site and immediately see their language, dates in their local format and times converted to their local time zone. Only time will tell how close I can get to that ideal.

On a side note, I would like to give special mention to PostgreSQL's support for time zone and date math. Being able to perform all date / time related functions at the database makes it trivial to implement zone aware time handling. PostgreSQL is, in my opinion, the RDBMS of choice for applications requiring non-trivial date and time handling or other i18n functionality.

So that's that. I now have all the tools necessary to ensure that this project will end up with a final product that is fully location aware, and allows users to select their locale and have all the relevant alterations to output tailored to their regional expectations. I intend that this project end up being a glowing example of i18n done right. If you have anything to add, or know of anything I may have overlooked, then I invite you to drop me a comment below.