[CWB] UTF corpus and frequency list issue

Thu Sep 1 18:37:45 CEST 2016

Hi all,

 > This is a limitation in the collations provided by MySQL: ideally,
 > we’d want to be able to switch case-sensitivity and
 > diacritic-sensitivity independently, but the choice of collations you
 > get don’t afford that.

My two cents: it's possible to simulate case-insensitive,
diacritic-*sensitive* collation in MySQL by NFD normalizing strings
before passing them on to the database, in conjunction with
`utf8_general_ci` collation. See this gist for a demo (you need
python3):

<https://gist.github.com/dlukes/25467d658a5c5f53be0cfb55969e7dcd>

It might be a better default behavior, since in the context of
linguistics, one almost never (?) wants diacritic-insensitive
comparisons. Though of course I've no idea how much effort it would take
to incorporate this into the codebase, so it might not be worth the
hassle :)

Best,

David

---
David Lukeš
Institute of the Czech National Corpus
Faculty of Arts, Charles University
Prague, Czech Republic