Archive for November, 2009

UTF8 collation support for Croatian, Bosnian and Serbian (latin) in MariaDB/MySQL

Sunday, November 29th, 2009

There has been a long lasting problem of collation in MySQL for Croatian language – it was impossible. When, at that time, Yugoslavian keyboard layout was invented, it was designed to cover all the languages from all republics. It covered all Slovenian characters (plus couple of characters that they don’t have), but not all Croatian (it missed ‘nj’, ‘lj’ and ‘dž’). When Yugoslavia fall apart, all the republics just took already wide spread Yugoslav layout. For Slovenian layout, that was great, with exception of including characters they didn’t have. For Croatian and Serbian latin, well, not that great…

You see, now we type letter ‘nj’ as a combination of ‘n’ and ‘j’. Same thing with ‘lj’ and ‘dž’. That wouldn’t be that bad if every word, containing ‘l’ or ‘n’ and ‘j’ together would be pronounced as ‘lj’ or ‘nj’. For example, we have two words that we write the same (injekcija), but pronounce different. In one case with say it with ‘nj’, and in the other as ‘n’ and ‘j’. Talking about the bad choice of deciding to use ‘nj’ as a character for that letter. There are also examples of ‘dž’.

As you can see, until we put ‘lj’ and ‘nj’ characters on the keyboard, we will never have correct sorting in any database. Good news is that those characters exists in Unicode and that’s why we have ‘hr unicode’ layout in Xorg. Too bad nobody uses it.

Until that’s sorted out, I’m happy to announce that MariaDB/MySQL just accepted a patch that makes possible contracting non-ascii characters, meaning that we can now have sorting rules for ‘dž’ (more about that at http://www.collation-charts.org/articles/croatian.htm). As a result, utf8_croatian_ci and ucs2_croatian_ci collations were created and added to MariaDB 5.1 and MySQL 5.6. Since Alexander Barkov was so kind and provided a patch for MySQL 5.1, I’ve created packages for Ubuntu. I’ve also modified that patch so that it works with MySQL 5.0. If you need this feature, go add my PPA to your sources.list:

https://edge.launchpad.net/~ivoks/+archive/mysql-hr/

It’s important to realize that this patch contains very intrusive change in collation mechanism, so it’s not just a patch for Croatian collation. People from Bosnia and Herzegovina, Monte Negro and Serbia (latin) can also use this collation for their languages. It does not cover all problems (‘injekcija’ and ‘injekcija’ for example), but at least words starting with dž won’t be at the end of the sort :)

Big thanks to MariaDB community!