Few notes on locale craziness

Back in the EAPI 6 guide I shortly noted that we have added a sanitization requirement for locales. Having been informed of another locale issue in Python (pre-EAPI 6 ebuild), I have decided to write a short note of locale curiosities that could also serve in reporting issues upstream.

When l10n and i18n are concerned, most of the developers correctly predict that the date and time format, currencies, number formats are going to change. It’s rather hard to find an application that would fail because of changed system date format; however, much easier to find one that does not respect the locale and uses hard-coded format strings for user display. You can find applications that unconditionally use a specific decimal separator but it’s quite rare to find one that chokes itself combining code using hard-coded separator and system routines respecting locales. Some applications rely on English error messages but that’s rather obviously perceived as mistake. However, there are also two hard cases…

Lowercase and uppercase

For a start, if you thought that the ASCII range of lowercase characters would map clearly to the ASCII range of uppercase characters, you were wrong. The Turkish (tr_TR) locale is different here, and maps lowercase ‘i’ (LATIN SMALL LETTER I) into uppercase ‘İ’ (LATIN CAPITAL LETTER I WITH DOT ABOVE). Similarly, ‘I’ (LATIN CAPITAL LETTER I) maps to ‘ı’ (LATIN SMALL LETTER DOTLESS I). What does this mean in practice? That if you have a Turkish user, then depending on the software used, you Latin ‘i’ may be uppercased onto ‘I’ (as you expect it to be), ‘İ’ (as would be correct in free text) or… left as ‘i’.

What’s the solution for this? If you need to uppercase/lowercase an ASCII text (e.g. variable names), either use a function that does not respect locale (e.g. 'i' - ('a' - 'A') in C) or set LC_CTYPE to a sane locale (e.g. C). However, remember that LC_CTYPE affects the character encoding — i.e. if you read UTF-8, you need to use a locale with UTF-8 codeset.

Collation

The other problem is collation, i.e. sorting. The more obvious part of it is that the particular locales enforce specific sorting of their specific diacritic characters. For example, the Polish letter ‘ą’ would be sorted between ‘a’ and ‘b’ in the Polish locale, and somewhere at the end in the C locale. The intermediately obvious part of it is that some locales have different ordering of lowercase and uppercase characters — the C and German locales sort uppercase characters first (the former because of ASCII codes), while many other locales sort the opposite.

Now, the non-obvious part is that some locales actually reorder the Latin alphabet. For example, the Estonian (et_EE) locale puts ‘z’ somewhere between ‘s’ and ‘t’. Yep, seriously. What’s even less obvious is that it means that the [a-z] character class suddenly ends halfway through the lowercase characters!

What’s the solution? Again, either use non-locale-sensitive functions or sanitize LC_COLLATE. For regular expressions, the named character classes ([[:lower:]], [[:upper:]]) are always a better choice.

Does anyone know more fun locales?

One thought on “Few notes on locale craziness”

  1. Minor problem with the “C locales cover Turkish” thing. I’m genuinely not sure if there’s a requirement that toupper/tolower behaves consistently with towupper/towlower, and exactly how to handle such an inconsistency correctly if it’s even allowed. toupper(‘i’) in a Turkish locale has no obvious correct return value: the spec says “If the argument is a character for which islower is true and there are one or more corresponding characters, as specified by the current locale, for which isupper is true, the toupper function returns one of the corresponding characters (always the same one for any given locale); otherwise, the argument is returned unchanged.” However, while there is such a character, it is not representable as a single char. So it should arguably return ‘i’, but then towupper(L’i’) still needs to return L’İ’, and that inconsistent behavior is either illegal or very, very weird. Fundamentally confusing case that the C locale system, at best, does not handle well.

Leave a Reply

Your email address will not be published.