[CWB] [CQPWeb] diacritics in CQPweb

Wed Mar 26 18:59:44 CET 2014

Apologies if this is not relevant, but I thought that unicode sorting 
recognized four "levels" in comparing two strings:
1. account is taken of differences in accents, case and specials
2. account is taken of differences in accents and case, but differences in 
specials are disregarded
3. account is taken of differences in accents, but differences in case and 
specials are disregarded
4. differences in accents, case and specials are disregarded

In these terms, what Michel wants is collation at level 3, but is getting 
collation at level 4.

If the CWB developers have access to a "standard" collation procedure, it 
should take care of this requirement automatically, with the additional 
benefit that efficiency considerations can be left to the implementors of 
the standard procedure!

(Specials are non-alphabetic characters, including punctuation, which may be 
present in the strings.)

For more info, see http://en.wikipedia.org/wiki/ISO_14651 or 
http://www.unicode.org/reports/tr10/

I hope this is helpful,
Ciarán Ó Duibhín.

----- Original Message ----- 
From: "Hardie, Andrew" <a.hardie at lancaster.ac.uk>
To: "Open source development of the Corpus WorkBench" <cwb at sslmit.unibo.it>
Sent: Wednesday, March 26, 2014 5:16 PM
Subject: Re: [CWB] [CQPWeb] diacritics in CQPweb

> Unfortunately, at the moment as you say there is a choice between CS/DS 
> and CI/DI, while for most linguistic purposes we want CI/DS. One of my 
> planned developments is to introduce custom collations that can be loaded 
> into MySQL that will allow CI/DS because I want it too! ( I think I would 
> have to define one from scratch based on automated mapping from the 
> Unicode standard datadase UNIDATA.TXT).
>
> However, I need to find out first how this will affect performance. I have 
> tried to find out whether using a custom, rather than built-in, collation 
> affects MySQL performance (and also what effect the complexity of the 
> custom collation has), but cannot find much online about it. So I will 
> need to take time to do some empirical experimentation at some point.
>
> So ---- if anyone has any info or experience about MySQL custom collations 
> that would be very useful.
>
> best
>
> Andrew.
>
> -----Original Message-----
> From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On 
> Behalf Of genereux
> Sent: 26 March 2014 10:32
> To: Open source development of the Corpus WorkBench
> Subject: [CWB] [CQPWeb] diacritics in CQPweb
>
> Hi,
>
> Here's an issue concerning diacritics in CQPweb.
>
> CQPweb stores frequency lists in mysql. Since there are no 
> case-insensitive diacritic-sensitive collations currently available in 
> mysql, a frequency list merges tokens/characters as follows:
>
> [e,é,É,Ê,E, ...] [o,ò,ó,Ô,O, ...] ...
>
> What we want is:
>
> [e,E] [é,É] [Ê,ê] [o,O] [ò,Ò] [ó,Ó] ...
>
> We can take care of the case-insensitivity programmatically outside 
> CQPweb/mysql by turning to lowercase records before they enter the DB 
> table. Tables holding frequency lists are then declared as 'collate 
> utf8_bin', which takes care of diacritic-sensitivity.
>
> I am wondering if people involved with corpora for languages other than 
> English have dealt with this issue in some other (more elegant) way?
>
> Thank you,
>
> Michel Généreux
>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>