Back in the EAPI 6 guide I shortly noted that we have added a sanitization requirement for locales. Having been informed of another locale issue in Python (pre-EAPI 6 ebuild), I have decided to write a short note of locale curiosities that could also serve in reporting issues upstream.
When l10n and i18n are concerned, most of the developers correctly predict that the date and time format, currencies, number formats are going to change. It’s rather hard to find an application that would fail because of changed system date format; however, much easier to find one that does not respect the locale and uses hard-coded format strings for user display. You can find applications that unconditionally use a specific decimal separator but it’s quite rare to find one that chokes itself combining code using hard-coded separator and system routines respecting locales. Some applications rely on English error messages but that’s rather obviously perceived as mistake. However, there are also two hard cases…
Lowercase and uppercase
For a start, if you thought that the ASCII range of lowercase characters would map clearly to the ASCII range of uppercase characters, you were wrong. The Turkish (tr_TR) locale is different here, and maps lowercase ‘i’ (LATIN SMALL LETTER I) into uppercase ‘İ’ (LATIN CAPITAL LETTER I WITH DOT ABOVE). Similarly, ‘I’ (LATIN CAPITAL LETTER I) maps to ‘ı’ (LATIN SMALL LETTER DOTLESS I). What does this mean in practice? That if you have a Turkish user, then depending on the software used, you Latin ‘i’ may be uppercased onto ‘I’ (as you expect it to be), ‘İ’ (as would be correct in free text) or… left as ‘i’.
What’s the solution for this? If you need to uppercase/lowercase an ASCII text (e.g. variable names), either use a function that does not respect locale (e.g.
'i' - ('a' - 'A') in C) or set LC_CTYPE to a sane locale (e.g.
C). However, remember that LC_CTYPE affects the character encoding — i.e. if you read UTF-8, you need to use a locale with UTF-8 codeset.
The other problem is collation, i.e. sorting. The more obvious part of it is that the particular locales enforce specific sorting of their specific diacritic characters. For example, the Polish letter ‘ą’ would be sorted between ‘a’ and ‘b’ in the Polish locale, and somewhere at the end in the C locale. The intermediately obvious part of it is that some locales have different ordering of lowercase and uppercase characters — the C and German locales sort uppercase characters first (the former because of ASCII codes), while many other locales sort the opposite.
Now, the non-obvious part is that some locales actually reorder the Latin alphabet. For example, the Estonian (et_EE) locale puts ‘z’ somewhere between ‘s’ and ‘t’. Yep, seriously. What’s even less obvious is that it means that the [a-z] character class suddenly ends halfway through the lowercase characters!
What’s the solution? Again, either use non-locale-sensitive functions or sanitize LC_COLLATE. For regular expressions, the named character classes ([[:lower:]], [[:upper:]]) are always a better choice.
Does anyone know more fun locales?