Lakes Entrance road trip and Optus Wireless Broadband

10:30am 26th January 2008

Well it's the time of year when road trips are the in thing, so some friends and I as well as my brother and his girlfriend jumped into a van and a car and headed over to Lakes Entrance. It's a 4 hour dive, but thanks to a bunch of unnecessary stops including one at an SES managed stop off for drowsy drivers where they serve free tea, coffee and Milo. We eventually arrived at 2am, and the owner of the flats where we are staying was waiting up for us.

I have access to the Internet via Optus Wireless Broadband. I've tried the service a few times already, but it proved to be too much of a hassle to be worth it. The new USB modems however make it a snap. The software and drivers are all on the USB device itself, which has a standard USB mass media storage device. It's literally plug in, wait a bit, connected.

Anyway, I'm hoping to be back in Melbourne on Tuesday. Hopefully tomorrow we'll catch some fish when we visit the local jetty, it'd be nice to have some lemon baked flathead that we caught ourselves. Here's to hoping!

Fun with internationalization, ISO 3166-1, ISO 639 and the CLDR

3:13pm 8th January 2008

So I came to a point in a software project that I am working on where it is time to consider how foreign users will be able to use the system. Yep, it's time for i18n and L10n. Initially, I thought that this would be as easy as simply setting up a translation matrix, and wrapping all static data in a function that performs a lookup on it and outputs the appropriate language.

This worked fine, I set up the matrix using a list of languages maintained in the database according to the ISO 639-1 standard. I wrote a simple yet effective tool for maintaining translation tables and began using online translation services to translate the most simple words and phrases, performing reverse translations on them to ensure they were not contextually mutated by the translator. Most words could be done this way, and in a very short period of time I was able to have substantial portions of the site in other languages including non-latin based languages such as Russian, Greek and Thai. All was well in the world.

Or so I thought. It all hit me at once. A single thought that brought all my cleverness crashing down, and threw up a whole new set of problems to overcome. It was like the opposite of an epiphany. Here's the situation: If I have a list of languages in the ISO 639 list, how do I handle the fact that the same language may actually need two translations? E.g., how would I deal with translating the word "colour" for the American spelling? The ISO code 'EN' does not allow for more than one English language. And then there's Portuguese; it has two regional variants, that spoken in Portugal and that spoken in Brazil. Finally, Chinese. ISO 639 only specified a single language, 'ZH', however there are two written forms, traditional and simplified, which are practically different languages.

It was clear that my solution was just not sufficient. Back to the drawing board.

After pondering this problem for some time, I decided that if I was going to do i18n at all, I was going to do it properly. After all, I've been uncompromising in my support for and awareness of time zone handling. Additionally, I had been careful from the word go about using UTF-8 aware functionality in the code, the database and all ancillary systems. It'd be silly to compromise on proper i18n at this point. That meant using appropriate date formats, spelling variants, weekday/weekend definitions, everything. But how? I had nowhere near the resources to engage in the collation and maintenance of that information. Nonetheless, I was determined that it be done right.

I was given salvation by another IRCer who is also working with this at the moment. He told me about the Common Locale Data Repository, or CLDR for short. It is a large, freely available repository of locale data stored and distributed in a widely used format, an XML DTD called LDML. This DTD is used in many projects for the exchange of locale data, including Microsoft's .NET framework. A relatively new project, having only started in 2003, the CLDR is already by far the most comprehensive repository of locale data that is freely available. It is maintained by the Unicode consortium, so one can be certain that things like format stability, backwards compatibility and data consistency are going to be given due attention.

Implementing full i18n/L10n in my project will be fairly involved, but not difficult. Locale identifiers are made up of ISO 639 codes and ISO 3166-1 alpha 2 codes, of which I already have authoritative sources. Mating them for valid locales is a trivial job, and now instead of my translation table being populated on one axis by language, it now has locale with a default "fallback" language, meaning very little change is required to the infrastructure already written to support language variants from trivial one or two word changes through to wholly different languages with different character sets. Furthermore, it does not seem that CLDR data even needs to be integrated into the DB before being used, the XML file can be stored locally and queried directly, meaning that future versions of the file become a drop-in replacement allowing virtually effortless expansion of locale awareness. The CLDR allows me to trivially do the following:

  • Translate basic data such as month names, day of the week names and names of countries into many languages.
  • Format dates and times according to local conventions.
  • Perform character repertoire tests to guarantee that the fonts used include the all necessary characters to render a language fully.
  • Because the CLDR uses the same time zone list as most posix systems, zone.tab which is part of the zoneinfo database, I can use it to translate time zone names into local languages
  • Determine local currency and its symbol.

In order to serve a user's needs, I will need to get as many of the following details from them in this order:

  1. Locale: As a basic minimum, this will allow the representation of content in their local language and formatting of dates in their expected format.
  2. Location: As users may use locales for places that are not where they are located physically, a user may want to specify their country so that they can be shown appropriate location based settings and defaults.
  3. Timezone: This cannot be reliably inferred from the above two pieces of information, although it can be reasonably accurately guessed. The user will however need the option of changing it, should they wish.

In order to allow full localization, those three pieces of information need to be determined about the user. Much of it can be inferred by using things like IP based geolocation data providers such as IP2Location and MaxMind. The ultimate goal is for a user to just hit the web site and immediately see their language, dates in their local format and times converted to their local time zone. Only time will tell how close I can get to that ideal.

On a side note, I would like to give special mention to PostgreSQL's support for time zone and date math. Being able to perform all date / time related functions at the database makes it trivial to implement zone aware time handling. PostgreSQL is, in my opinion, the RDBMS of choice for applications requiring non-trivial date and time handling or other i18n functionality.

So that's that. I now have all the tools necessary to ensure that this project will end up with a final product that is fully location aware, and allows users to select their locale and have all the relevant alterations to output tailored to their regional expectations. I intend that this project end up being a glowing example of i18n done right. If you have anything to add, or know of anything I may have overlooked, then I invite you to drop me a comment below.