{"id":543,"date":"2016-09-22T22:13:31","date_gmt":"2016-09-22T20:13:31","guid":{"rendered":"https:\/\/blogs.gentoo.org\/mgorny\/?p=543"},"modified":"2016-09-22T23:07:35","modified_gmt":"2016-09-22T21:07:35","slug":"few-notes-on-locale-craziness","status":"publish","type":"post","link":"https:\/\/blogs.gentoo.org\/mgorny\/2016\/09\/22\/few-notes-on-locale-craziness\/","title":{"rendered":"Few notes on locale craziness"},"content":{"rendered":"<p>Back in the\u00a0<a href=\"https:\/\/blogs.gentoo.org\/mgorny\/2015\/11\/13\/the-ultimate-guide-to-eapi-6\/\">EAPI 6 guide<\/a> I shortly noted that we have added a\u00a0sanitization requirement for locales. Having been informed of <a rel=\"external\" href=\"https:\/\/bugs.gentoo.org\/594768\">another locale issue in Python<\/a> (pre-EAPI 6 ebuild), I have decided to write a\u00a0short note of locale curiosities that could also serve in\u00a0reporting issues upstream.<\/p>\n<p>When l10n and\u00a0i18n are concerned, most of the\u00a0developers correctly predict that the\u00a0date and\u00a0time format, currencies, number formats are going to change. It&#8217;s rather hard to find an\u00a0application that would fail because of changed system date format; however, much easier to find one that does not respect the\u00a0locale and\u00a0uses hard-coded format strings for user display. You can find applications that unconditionally use a\u00a0specific decimal separator but it&#8217;s quite rare to find one that chokes itself combining code using hard-coded separator and\u00a0system routines respecting locales. Some applications rely on English error messages but that&#8217;s rather obviously perceived as mistake. However, there are also two hard cases\u2026<\/p>\n<p><!--more--><\/p>\n<h2>Lowercase and\u00a0uppercase<\/h2>\n<p>For a\u00a0start, if you thought that the\u00a0ASCII range of\u00a0lowercase characters would map clearly to the\u00a0ASCII range of\u00a0uppercase characters, you were wrong. The\u00a0Turkish (<kbd>tr_TR<\/kbd>) locale is different here, and\u00a0maps lowercase \u2018i\u2019 (LATIN SMALL LETTER I) into uppercase \u2018\u0130\u2019 (LATIN CAPITAL LETTER I WITH DOT ABOVE). Similarly, \u2018I\u2019 (LATIN CAPITAL LETTER I) maps to \u2018\u0131\u2019 (LATIN SMALL LETTER DOTLESS I). What does this mean in\u00a0practice? That if you have a\u00a0Turkish user, then depending on the\u00a0software used, you Latin \u2018i\u2019 may be uppercased onto \u2018I\u2019 (as you expect it to be), \u2018\u0130\u2019 (as would be correct in\u00a0free text) or\u2026 left as \u2018i\u2019.<\/p>\n<p>What&#8217;s the\u00a0solution for this? If you need to uppercase\/lowercase an\u00a0ASCII text (e.g. variable names), either use a\u00a0function that does not respect locale (e.g. <code>'i' - ('a' - 'A')<\/code> in\u00a0C) or set <kbd>LC_CTYPE<\/kbd> to a\u00a0sane locale (e.g. <code>C<\/code>). However, remember that <kbd>LC_CTYPE<\/kbd> affects the\u00a0character encoding \u2014 i.e. if you read UTF-8, you need to use a\u00a0locale with UTF-8 codeset.<\/p>\n<h2>Collation<\/h2>\n<p>The\u00a0other problem is collation, i.e. sorting. The\u00a0more obvious part of it is that the\u00a0particular locales enforce specific sorting of\u00a0their specific diacritic characters. For example, the\u00a0Polish letter \u2018\u0105\u2019 would be sorted between \u2018a\u2019 and\u00a0\u2018b\u2019 in\u00a0the\u00a0Polish locale, and\u00a0somewhere at\u00a0the\u00a0end in\u00a0the\u00a0C locale. The\u00a0intermediately obvious part of it is that some locales have different ordering of lowercase and\u00a0uppercase characters \u2014 the\u00a0C and\u00a0German locales sort uppercase characters first (the\u00a0former because of ASCII codes), while many other locales sort the\u00a0opposite.<\/p>\n<p>Now, the\u00a0non-obvious part is that some locales actually reorder the\u00a0Latin alphabet. For example, the\u00a0Estonian (<kbd>et_EE<\/kbd>) locale puts \u2018z\u2019 somewhere between \u2018s\u2019 and\u00a0\u2018t\u2019. Yep, seriously. What&#8217;s even less obvious is that it means that the\u00a0<kbd>[a-z]<\/kbd> character class suddenly ends halfway through the\u00a0lowercase characters!<\/p>\n<p>What&#8217;s the\u00a0solution? Again, either use non-locale-sensitive functions or\u00a0sanitize <kbd>LC_COLLATE<\/kbd>. For\u00a0regular expressions, the\u00a0named character classes (<kbd>[[:lower:]]<\/kbd>, <kbd>[[:upper:]]<\/kbd>) are always a\u00a0better choice.<\/p>\n<p>Does anyone know more fun locales?<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Back in the\u00a0EAPI 6 guide I shortly noted that we have added a\u00a0sanitization requirement for locales. Having been informed of another locale issue in Python (pre-EAPI 6 ebuild), I have decided to write a\u00a0short note of locale curiosities that could also serve in\u00a0reporting issues upstream. When l10n and\u00a0i18n are concerned, most of the\u00a0developers correctly predict &hellip; <a href=\"https:\/\/blogs.gentoo.org\/mgorny\/2016\/09\/22\/few-notes-on-locale-craziness\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Few notes on locale craziness&#8221;<\/span><\/a><\/p>\n","protected":false},"author":137,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"jetpack_publicize_message":"","jetpack_is_tweetstorm":false,"jetpack_publicize_feature_enabled":true},"categories":[8],"tags":[],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/blogs.gentoo.org\/mgorny\/wp-json\/wp\/v2\/posts\/543"}],"collection":[{"href":"https:\/\/blogs.gentoo.org\/mgorny\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.gentoo.org\/mgorny\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.gentoo.org\/mgorny\/wp-json\/wp\/v2\/users\/137"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.gentoo.org\/mgorny\/wp-json\/wp\/v2\/comments?post=543"}],"version-history":[{"count":5,"href":"https:\/\/blogs.gentoo.org\/mgorny\/wp-json\/wp\/v2\/posts\/543\/revisions"}],"predecessor-version":[{"id":549,"href":"https:\/\/blogs.gentoo.org\/mgorny\/wp-json\/wp\/v2\/posts\/543\/revisions\/549"}],"wp:attachment":[{"href":"https:\/\/blogs.gentoo.org\/mgorny\/wp-json\/wp\/v2\/media?parent=543"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.gentoo.org\/mgorny\/wp-json\/wp\/v2\/categories?post=543"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.gentoo.org\/mgorny\/wp-json\/wp\/v2\/tags?post=543"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}