Fun with internationalization, ISO 3166-1, ISO 639 and the CLDR

3:13pm 8th January 2008

So I came to a point in a software project that I am working on where it is time to consider how foreign users will be able to use the system. Yep, it's time for i18n and L10n. Initially, I thought that this would be as easy as simply setting up a translation matrix, and wrapping all static data in a function that performs a lookup on it and outputs the appropriate language.

This worked fine, I set up the matrix using a list of languages maintained in the database according to the ISO 639-1 standard. I wrote a simple yet effective tool for maintaining translation tables and began using online translation services to translate the most simple words and phrases, performing reverse translations on them to ensure they were not contextually mutated by the translator. Most words could be done this way, and in a very short period of time I was able to have substantial portions of the site in other languages including non-latin based languages such as Russian, Greek and Thai. All was well in the world.

Or so I thought. It all hit me at once. A single thought that brought all my cleverness crashing down, and threw up a whole new set of problems to overcome. It was like the opposite of an epiphany. Here's the situation: If I have a list of languages in the ISO 639 list, how do I handle the fact that the same language may actually need two translations? E.g., how would I deal with translating the word "colour" for the American spelling? The ISO code 'EN' does not allow for more than one English language. And then there's Portuguese; it has two regional variants, that spoken in Portugal and that spoken in Brazil. Finally, Chinese. ISO 639 only specified a single language, 'ZH', however there are two written forms, traditional and simplified, which are practically different languages.

It was clear that my solution was just not sufficient. Back to the drawing board.

After pondering this problem for some time, I decided that if I was going to do i18n at all, I was going to do it properly. After all, I've been uncompromising in my support for and awareness of time zone handling. Additionally, I had been careful from the word go about using UTF-8 aware functionality in the code, the database and all ancillary systems. It'd be silly to compromise on proper i18n at this point. That meant using appropriate date formats, spelling variants, weekday/weekend definitions, everything. But how? I had nowhere near the resources to engage in the collation and maintenance of that information. Nonetheless, I was determined that it be done right.

I was given salvation by another IRCer who is also working with this at the moment. He told me about the Common Locale Data Repository, or CLDR for short. It is a large, freely available repository of locale data stored and distributed in a widely used format, an XML DTD called LDML. This DTD is used in many projects for the exchange of locale data, including Microsoft's .NET framework. A relatively new project, having only started in 2003, the CLDR is already by far the most comprehensive repository of locale data that is freely available. It is maintained by the Unicode consortium, so one can be certain that things like format stability, backwards compatibility and data consistency are going to be given due attention.

Implementing full i18n/L10n in my project will be fairly involved, but not difficult. Locale identifiers are made up of ISO 639 codes and ISO 3166-1 alpha 2 codes, of which I already have authoritative sources. Mating them for valid locales is a trivial job, and now instead of my translation table being populated on one axis by language, it now has locale with a default "fallback" language, meaning very little change is required to the infrastructure already written to support language variants from trivial one or two word changes through to wholly different languages with different character sets. Furthermore, it does not seem that CLDR data even needs to be integrated into the DB before being used, the XML file can be stored locally and queried directly, meaning that future versions of the file become a drop-in replacement allowing virtually effortless expansion of locale awareness. The CLDR allows me to trivially do the following:

  • Translate basic data such as month names, day of the week names and names of countries into many languages.
  • Format dates and times according to local conventions.
  • Perform character repertoire tests to guarantee that the fonts used include the all necessary characters to render a language fully.
  • Because the CLDR uses the same time zone list as most posix systems, which is part of the zoneinfo database, I can use it to translate time zone names into local languages
  • Determine local currency and its symbol.

In order to serve a user's needs, I will need to get as many of the following details from them in this order:

  1. Locale: As a basic minimum, this will allow the representation of content in their local language and formatting of dates in their expected format.
  2. Location: As users may use locales for places that are not where they are located physically, a user may want to specify their country so that they can be shown appropriate location based settings and defaults.
  3. Timezone: This cannot be reliably inferred from the above two pieces of information, although it can be reasonably accurately guessed. The user will however need the option of changing it, should they wish.

In order to allow full localization, those three pieces of information need to be determined about the user. Much of it can be inferred by using things like IP based geolocation data providers such as IP2Location and MaxMind. The ultimate goal is for a user to just hit the web site and immediately see their language, dates in their local format and times converted to their local time zone. Only time will tell how close I can get to that ideal.

On a side note, I would like to give special mention to PostgreSQL's support for time zone and date math. Being able to perform all date / time related functions at the database makes it trivial to implement zone aware time handling. PostgreSQL is, in my opinion, the RDBMS of choice for applications requiring non-trivial date and time handling or other i18n functionality.

So that's that. I now have all the tools necessary to ensure that this project will end up with a final product that is fully location aware, and allows users to select their locale and have all the relevant alterations to output tailored to their regional expectations. I intend that this project end up being a glowing example of i18n done right. If you have anything to add, or know of anything I may have overlooked, then I invite you to drop me a comment below.

nVidia chipsets and Linux

9:16am 12th March 2006

I am at a loss as to why nVidia refuses to publish the specifications for its nForce 2200 Pro chipset. I can understand the need to keep the drivers for its graphics cards closed source binary only distributions, but as to its decision to take the same path with its chipsets, I am quite mystified.

Motherboard chipsets, in order to be transparently available to the user, need to be integrated into the operating system. Distributing binaries makes the installation of I/O controllers, RAID cards and other onboard devices a pain for users who just want to have a system up and running as quickly as possible.

Furthermore, nVidia's chipsets are the best chipsets for the AMD64 platform, which is as popular with Linux servers as it is among gamers. If nVidia wants in on this market in a meaningful way, drivers will need to be incorporated into the Linux kernel, something that cannot happen unless usable specs are given to the maintainers of the relevant Linux modules. Jeff Garzik, maintainer of the kernel module libATA, has been quoted as saying that "Unfortunately, Nvidia is the only SATA hardware vendor that chooses not to give me any hardware information".

As it stands, installing Linux onto the new server I am unable to use the nvRAID functionality of the board, which presumably would allow me to use the hot swap bays properly and allow for automatic volume rebuilds in the case of disk failure. Instead, I am using Linux's md system to provide software RAID functionality. It has proven to be a very high performance and reliable solution indeed, but I still feel that using even the partial RAID functionality provided by the nForce 2200 chipset would be preferable.


6:31am 3rd November 2005
Bloody hell. I have just spent the morning configuring the medical software at my parents' new medical clinic. After setting up MS Windows 2003 Serve with SQL Server 2000, all the workstations with the appropriate permissions and settings and getting everything to what I thought was ready for the install technician to do her job, I found that things were not going to be smooth.

First off, PractiX, the ridiculously unpolished software package for managing medical centers, had requirements over the network environment that were so specific that it is hard to imagine integrating the software into an existing infrastructure that was not specifically deployed with a view to using that exclusively. E.g., it can ONLY use the 192.168.1.* subnet, all users need to be in the workgroup "practix" and they require MS SQL Server's Query Analyser to be installed. What use that can have in a production deployment I don't know.

Secondly, the installer technician had no idea what she was doing. She called me in claiming that the server's "ODBC settings weren't allowing the application to access the SQL server" and that the database files needed to be in the default location on the server. So I drove there to find that in actuality, the problem was that she had not configured PractiX with the required server details. I mean what the HELL?! If I wasn't familiar with the PractiX package, how would that have been fixed? She had no idea, she just mindlessly blamed the installation of Windows and SQL Server. Then, this morning, she didn't know how to set up the printers, which again was a setting in the application she was sent out to install.

I really, really hate inept technicians. She did admit to not knowing much about "the computer side of things". Well then what the hell is she doing working with the computer side of things? As far as I can tell, all she *can* do is put a CD into the drive and click OK. Grr! This is a very frustrated Naz signing out.