UTF8 collation support for Croatian, Bosnian and Serbian (latin) in MariaDB/MySQL

There has been a long lasting problem of collation in MySQL for Croatian language – it was impossible. When, at that time, Yugoslavian keyboard layout was invented, it was designed to cover all the languages from all republics. It covered all Slovenian characters (plus couple of characters that they don’t have), but not all Croatian (it missed ‘nj’, ‘lj’ and ‘dž’). When Yugoslavia fall apart, all the republics just took already wide spread Yugoslav layout. For Slovenian layout, that was great, with exception of including characters they didn’t have. For Croatian and Serbian latin, well, not that great…

You see, now we type letter ‘nj’ as a combination of ‘n’ and ‘j’. Same thing with ‘lj’ and ‘dž’. That wouldn’t be that bad if every word, containing ‘l’ or ‘n’ and ‘j’ together would be pronounced as ‘lj’ or ‘nj’. For example, we have two words that we write the same (injekcija), but pronounce different. In one case with say it with ‘nj’, and in the other as ‘n’ and ‘j’. Talking about the bad choice of deciding to use ‘nj’ as a character for that letter. There are also examples of ‘dž’.

As you can see, until we put ‘lj’ and ‘nj’ characters on the keyboard, we will never have correct sorting in any database. Good news is that those characters exists in Unicode and that’s why we have ‘hr unicode’ layout in Xorg. Too bad nobody uses it.

Until that’s sorted out, I’m happy to announce that MariaDB/MySQL just accepted a patch that makes possible contracting non-ascii characters, meaning that we can now have sorting rules for ‘dž’ (more about that at http://www.collation-charts.org/articles/croatian.htm). As a result, utf8_croatian_ci and ucs2_croatian_ci collations were created and added to MariaDB 5.1 and MySQL 5.6. Since Alexander Barkov was so kind and provided a patch for MySQL 5.1, I’ve created packages for Ubuntu. I’ve also modified that patch so that it works with MySQL 5.0. If you need this feature, go add my PPA to your sources.list:

https://edge.launchpad.net/~ivoks/+archive/mysql-hr/

It’s important to realize that this patch contains very intrusive change in collation mechanism, so it’s not just a patch for Croatian collation. People from Bosnia and Herzegovina, Monte Negro and Serbia (latin) can also use this collation for their languages. It does not cover all problems (‘injekcija’ and ‘injekcija’ for example), but at least words starting with dž won’t be at the end of the sort :)

Big thanks to MariaDB community!

Tags:

5 Responses to “UTF8 collation support for Croatian, Bosnian and Serbian (latin) in MariaDB/MySQL”

  1. tabgilbert says:

    >…..characters on the keyboard, we will never have correct sorting in any >database.

    Sort characteristics of difference languages has never occurred to me. One of the downsides of being an American. Very interesting. Thank you for that little bit of knowledge.

  2. sloser says:

    goodbye utf8_slovenian_ci

  3. seven says:

    injekcija without NJ is not croatian word.

  4. Ante,

    I don’t mean to pick a nit, but you should s/MySQL/MariaDB/ in the title of this post.

    We have, indeed, pushed the fix into the MariaDB source, and future builds of MariaDB will offer correct sorting of Croatian and other Balkan language contractions.

    The Maria project always treats the upstream MySQL project as an equal, and we always provide our patches to them for inclusion in mainline MySQL. However, the final decision as to what gets merged lies with MySQL maintainers.

    If correct contraction sorting is of critical importance to your database use, I would encourage you to evaluate MariaDB during this beta cycle and switch to it when we have our final release before the end of the year.

    If using MySQL instead of MariaDB is more important, I encourage you to contact the MySQL devteam and push for inclusion of the patch we have written for MariaDB.

    Kako se kaže “responsive project” na hrvatskom? ;)

  5. @seven podžupan is and it fails in sorting cause of dž and dž.

    @Kurt big big kudos to you and Monty! Sorry, I had no intention of being disrespectful. I’ve titled it with MySQL cause, at the moment, MySQL is the DB we have in released Ubuntu releases. Have no doubt, I will be one of biggest supporters of having MariaDB instead of MySQL in next Ubuntu releases. And we will be evaluating MariaDB as a MySQL replacement for our production services (inside Init and with our partners) from now on.