[CWB] UTF corpus and frequency list issue
David Lukes
david.lukes at ff.cuni.cz
Thu Sep 1 18:37:45 CEST 2016
Hi all,
> This is a limitation in the collations provided by MySQL: ideally,
> we’d want to be able to switch case-sensitivity and
> diacritic-sensitivity independently, but the choice of collations you
> get don’t afford that.
My two cents: it's possible to simulate case-insensitive,
diacritic-*sensitive* collation in MySQL by NFD normalizing strings
before passing them on to the database, in conjunction with
`utf8_general_ci` collation. See this gist for a demo (you need
python3):
<https://gist.github.com/dlukes/25467d658a5c5f53be0cfb55969e7dcd>
It might be a better default behavior, since in the context of
linguistics, one almost never (?) wants diacritic-insensitive
comparisons. Though of course I've no idea how much effort it would take
to incorporate this into the codebase, so it might not be worth the
hassle :)
Best,
David
---
David Lukeš
Institute of the Czech National Corpus
Faculty of Arts, Charles University
Prague, Czech Republic
More information about the CWB
mailing list