[CWB] [CQPWeb] diacritics in CQPweb
genereux
genereux at clul.ul.pt
Wed Mar 26 11:32:04 CET 2014
Hi,
Here's an issue concerning diacritics in CQPweb.
CQPweb stores frequency lists in mysql. Since there are no
case-insensitive diacritic-sensitive collations currently available in
mysql, a frequency list merges tokens/characters as follows:
[e,é,É,Ê,E, ...] [o,ò,ó,Ô,O, ...] ...
What we want is:
[e,E] [é,É] [Ê,ê] [o,O] [ò,Ò] [ó,Ó] ...
We can take care of the case-insensitivity programmatically outside
CQPweb/mysql by turning to lowercase records before they enter the DB
table. Tables holding frequency lists are then declared as 'collate
utf8_bin', which takes care of diacritic-sensitivity.
I am wondering if people involved with corpora for languages other than
English have dealt with this issue in some other (more elegant) way?
Thank you,
Michel Généreux
More information about the CWB
mailing list