[CWB] [CQPWeb] diacritics in CQPweb

Hardie, Andrew a.hardie at lancaster.ac.uk
Wed Mar 26 18:16:17 CET 2014


Unfortunately, at the moment as you say there is a choice between CS/DS and CI/DI, while for most linguistic purposes we want CI/DS. One of my planned developments is to introduce custom collations that can be loaded into MySQL that will allow CI/DS because I want it too! ( I think I would have to define one from scratch based on automated mapping from the Unicode standard datadase UNIDATA.TXT).

However, I need to find out first how this will affect performance. I have tried to find out whether using a custom, rather than built-in, collation affects MySQL performance (and also what effect the complexity of the custom collation has), but cannot find much online about it. So I will need to take time to do some empirical experimentation at some point.

So ---- if anyone has any info or experience about MySQL custom collations that would be very useful.

best

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of genereux
Sent: 26 March 2014 10:32
To: Open source development of the Corpus WorkBench
Subject: [CWB] [CQPWeb] diacritics in CQPweb

Hi,

Here's an issue concerning diacritics in CQPweb.

CQPweb stores frequency lists in mysql. Since there are no case-insensitive diacritic-sensitive collations currently available in mysql, a frequency list merges tokens/characters as follows:

[e,é,É,Ê,E, ...] [o,ò,ó,Ô,O, ...] ...

What we want is:

[e,E] [é,É] [Ê,ê] [o,O] [ò,Ò] [ó,Ó] ...

We can take care of the case-insensitivity programmatically outside CQPweb/mysql by turning to lowercase records before they enter the DB table. Tables holding frequency lists are then declared as 'collate utf8_bin', which takes care of diacritic-sensitivity.

I am wondering if people involved with corpora for languages other than English have dealt with this issue in some other (more elegant) way?

Thank you,

Michel Généreux


_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list