[CWB] [CQPWeb] diacritics in CQPweb

genereux genereux at clul.ul.pt
Wed Mar 26 11:32:04 CET 2014


Hi,

Here's an issue concerning diacritics in CQPweb.

CQPweb stores frequency lists in mysql. Since there are no 
case-insensitive diacritic-sensitive collations currently available in 
mysql, a frequency list merges tokens/characters as follows:

[e,é,É,Ê,E, ...] [o,ò,ó,Ô,O, ...] ...

What we want is:

[e,E] [é,É] [Ê,ê] [o,O] [ò,Ò] [ó,Ó] ...

We can take care of the case-insensitivity programmatically outside 
CQPweb/mysql by turning to lowercase records before they enter the DB 
table. Tables holding frequency lists are then declared as 'collate 
utf8_bin', which takes care of diacritic-sensitivity.

I am wondering if people involved with corpora for languages other than 
English have dealt with this issue in some other (more elegant) way?

Thank you,

Michel Généreux




More information about the CWB mailing list