[CWB] [CQPWeb] diacritics in CQPweb

Hardie, Andrew a.hardie at lancaster.ac.uk
Thu Mar 27 01:11:50 CET 2014


Not helpful at all, alas, as you missed the critical context that we are talking about the Unicode collations available *in MySQL*, on which CQPweb depends. These collations include one (utf8_bin) that does "level 1", and one (utf8_general_ci) which does "level 4", but nothing that does "level 3" or "level 2". That was why I was saying I would have to add one myself.

See: http://collation-charts.org/mysql60/mysql604.utf8_general_ci.european.html

(Note that as rotten as MySQL is on this front, so far as I can tell other RDBMSs are even worse, as they seem to link collations to OS locales, which is the last thing you want in this context)

best

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Ciarán Ó Duibhín
Sent: 26 March 2014 18:00
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] [CQPWeb] diacritics in CQPweb

Apologies if this is not relevant, but I thought that unicode sorting recognized four "levels" in comparing two strings:
1. account is taken of differences in accents, case and specials 2. account is taken of differences in accents and case, but differences in specials are disregarded 3. account is taken of differences in accents, but differences in case and specials are disregarded 4. differences in accents, case and specials are disregarded

In these terms, what Michel wants is collation at level 3, but is getting collation at level 4.

If the CWB developers have access to a "standard" collation procedure, it should take care of this requirement automatically, with the additional benefit that efficiency considerations can be left to the implementors of the standard procedure!

(Specials are non-alphabetic characters, including punctuation, which may be present in the strings.)

For more info, see http://en.wikipedia.org/wiki/ISO_14651 or http://www.unicode.org/reports/tr10/

I hope this is helpful,
Ciarán Ó Duibhín.

----- Original Message -----
From: "Hardie, Andrew" <a.hardie at lancaster.ac.uk>
To: "Open source development of the Corpus WorkBench" <cwb at sslmit.unibo.it>
Sent: Wednesday, March 26, 2014 5:16 PM
Subject: Re: [CWB] [CQPWeb] diacritics in CQPweb


> Unfortunately, at the moment as you say there is a choice between CS/DS 
> and CI/DI, while for most linguistic purposes we want CI/DS. One of my 
> planned developments is to introduce custom collations that can be loaded 
> into MySQL that will allow CI/DS because I want it too! ( I think I would 
> have to define one from scratch based on automated mapping from the 
> Unicode standard datadase UNIDATA.TXT).
>
> However, I need to find out first how this will affect performance. I have 
> tried to find out whether using a custom, rather than built-in, collation 
> affects MySQL performance (and also what effect the complexity of the 
> custom collation has), but cannot find much online about it. So I will 
> need to take time to do some empirical experimentation at some point.
>
> So ---- if anyone has any info or experience about MySQL custom collations 
> that would be very useful.
>
> best
>
> Andrew.
>
> -----Original Message-----
> From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On 
> Behalf Of genereux
> Sent: 26 March 2014 10:32
> To: Open source development of the Corpus WorkBench
> Subject: [CWB] [CQPWeb] diacritics in CQPweb
>
> Hi,
>
> Here's an issue concerning diacritics in CQPweb.
>
> CQPweb stores frequency lists in mysql. Since there are no 
> case-insensitive diacritic-sensitive collations currently available in 
> mysql, a frequency list merges tokens/characters as follows:
>
> [e,é,É,Ê,E, ...] [o,ò,ó,Ô,O, ...] ...
>
> What we want is:
>
> [e,E] [é,É] [Ê,ê] [o,O] [ò,Ò] [ó,Ó] ...
>
> We can take care of the case-insensitivity programmatically outside 
> CQPweb/mysql by turning to lowercase records before they enter the DB 
> table. Tables holding frequency lists are then declared as 'collate 
> utf8_bin', which takes care of diacritic-sensitivity.
>
> I am wondering if people involved with corpora for languages other than 
> English have dealt with this issue in some other (more elegant) way?
>
> Thank you,
>
> Michel Généreux
>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> 

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list