We recently had an interesting conversation in the office about language and dialect codes. If we store translations in translation memory, how should we handle 'en' (English) versus 'en-US' (English in the USA)? Should we assume that it is the same, or should we handle it as two entirely different languages? The problem is not quite as simple as one would have hoped for.
Many translators in the world of Free Software will be familiar with language codes such as 'af' (Afrikaans), 'pt_BR' (Portuguese in Brazil), and maybe even something like "sr@Latn" (Serbian written with Latin script). The latter is a format that is seen in glibc (therefore in the Linux world) and is used in cases where a single language is written with more than one alphabet or writing system. There are even standards for these language codes:
These standards should leave room for just about anything, including private extensions. Most applications will probably never see a complete language code with seven parts. It is also strongly discouraged. Our conversation specifically circled around how to save language specifiers, and how use them later during read operations amidst matches that are expected between simple and extended language codes. Specifically we had a store for translation memory in mind. A user that is looking for translation memory matches for Afrikaans ('af') almost definitely also want results stored under 'af-ZA' (Afrikaans in South Africa).
RFC 4647 explains how different matches should work within certain search operations. It covers lots of cases, and makes good sense. I guess we won't ever encounter language codes such as '*' or 'fr-*' in our applications. Therefore we might be able to side step this piece of complexity.
But what do we do in the case where the data wasn't stored 100% correctly? We specifically discussed the case of Arabic. The software of a translator in Morocco might perhaps suggest 'ar-MA' as working language, and the translator might just accept it without thinking and his translations will be saved as such in his files. In reality there isn't supposed to be any difference between the Arabic of the different countries, this is therefore not really accurate for optimal reuse. Somebody doing a search for 'ar' (generic Arabic), can get these results according to RFC 4647. But the software of an Arabic translator in Egypt could automatically suggest 'ar-EG' (Arabic in Egypt) as working language. A query for suggestions in 'ar-EG' will not give results saved under 'ar-MA', even if this is realistically what an Arabic translator would want in just about all cases.
Simplification looks attractive, but won't work in the case of Chinese, for example - 'zh-CN' (Chinese of Chine) simplified to 'zh' (generic Chinese) means nothing - there is no real localisation being done for 'zh' - actually only 'zh-CN'/'zh-TW' or 'zh-Hans'/'zh-Hant' (where the distinction is made on script rather than country). Somebody who wants suggestions from all types of Chinese would have to specify it in the same way as an Afrikaans translator might show an interest in Dutch.
The challenge will therefore be to make it as easy as possible on the application level for the user to make the right choice, without simplifying things automatically. The possibility of language specific rules look attractive, but it would have been nice if this wasn't necessary. I guess we will store things with detail, but try in the user interface to simplify when the query language is determined.
Comments
language-specific rules probably inevitable
I think you will end up needing to have language-specific rules, with some reasonable defaults for cases not handled specially (probably, to ignore country-code components if not otherwise specified). There was some discussion of this issue on the open-tran.eu blog last spring: http://open-tran.blogspot.com/2008/05/languages-and-cultures.html
@alex
Re: language-specific rules probably inevitable
Of course, we are trying to do the right thing, and not just for free software translations. If we transform the language codes, when and how do we do that? Before storing, or only when querying? We are currently going with a possible simplification on insert if we are not aware of the active (extended) locale normally doing separate things from the one with the simplified language code. This just means that we need to be very sure that our list of languages is fairly complete.