[CWB] [CQPWeb] diacritics in CQPweb

Ciarán Ó Duibhín ciaran at oduibhin.freeserve.co.uk
Thu Mar 27 15:32:27 CET 2014


Browsing around, I see that Firebird 2.5 has UTF8 collations called UNICODE, 
UNICODE_CI and UNICODE_CI_AI ( 
http://www.firebirdsql.org/file/documentation/reference_manuals/reference_material/html/langrefupd25-collations.html#langrefupd25-collations-unicode )

For MariaDB, there are many collations containing "ci" in their names, but I 
can't see whether they are "ai" or "as" ( 
https://mariadb.com/kb/en/supported-character-sets-and-collations/ )

It looks like MySQL may have some catching up to do. I suppose there 
wouldn't be a repository of user-defined collations for MySQL?

Ciarán Ó Duibhín

----- Original Message ----- 
From: "Hardie, Andrew" <a.hardie at lancaster.ac.uk>
To: "Open source development of the Corpus WorkBench" <cwb at sslmit.unibo.it>
Sent: Thursday, March 27, 2014 12:11 AM
Subject: Re: [CWB] [CQPWeb] diacritics in CQPweb


> Not helpful at all, alas, as you missed the critical context that we are 
> talking about the Unicode collations available *in MySQL*, on which CQPweb 
> depends. These collations include one (utf8_bin) that does "level 1", and 
> one (utf8_general_ci) which does "level 4", but nothing that does "level 
> 3" or "level 2". That was why I was saying I would have to add one myself.
>
> See: 
> http://collation-charts.org/mysql60/mysql604.utf8_general_ci.european.html
>
> (Note that as rotten as MySQL is on this front, so far as I can tell other 
> RDBMSs are even worse, as they seem to link collations to OS locales, 
> which is the last thing you want in this context)
>
> best
>
> Andrew.
>
> -----Original Message-----
> From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On 
> Behalf Of Ciarán Ó Duibhín
> Sent: 26 March 2014 18:00
> To: Open source development of the Corpus WorkBench
> Subject: Re: [CWB] [CQPWeb] diacritics in CQPweb
>
> Apologies if this is not relevant, but I thought that unicode sorting 
> recognized four "levels" in comparing two strings:
> 1. account is taken of differences in accents, case and specials 2. 
> account is taken of differences in accents and case, but differences in 
> specials are disregarded 3. account is taken of differences in accents, 
> but differences in case and specials are disregarded 4. differences in 
> accents, case and specials are disregarded
>
> In these terms, what Michel wants is collation at level 3, but is getting 
> collation at level 4.
>
> If the CWB developers have access to a "standard" collation procedure, it 
> should take care of this requirement automatically, with the additional 
> benefit that efficiency considerations can be left to the implementors of 
> the standard procedure!
>
> (Specials are non-alphabetic characters, including punctuation, which may 
> be present in the strings.)
>
> For more info, see http://en.wikipedia.org/wiki/ISO_14651 or 
> http://www.unicode.org/reports/tr10/
>
> I hope this is helpful,
> Ciarán Ó Duibhín.
>
> ----- Original Message -----
> From: "Hardie, Andrew" <a.hardie at lancaster.ac.uk>
> To: "Open source development of the Corpus WorkBench" 
> <cwb at sslmit.unibo.it>
> Sent: Wednesday, March 26, 2014 5:16 PM
> Subject: Re: [CWB] [CQPWeb] diacritics in CQPweb
>
>
>> Unfortunately, at the moment as you say there is a choice between CS/DS
>> and CI/DI, while for most linguistic purposes we want CI/DS. One of my
>> planned developments is to introduce custom collations that can be loaded
>> into MySQL that will allow CI/DS because I want it too! ( I think I would
>> have to define one from scratch based on automated mapping from the
>> Unicode standard datadase UNIDATA.TXT).
>>
>> However, I need to find out first how this will affect performance. I 
>> have
>> tried to find out whether using a custom, rather than built-in, collation
>> affects MySQL performance (and also what effect the complexity of the
>> custom collation has), but cannot find much online about it. So I will
>> need to take time to do some empirical experimentation at some point.
>>
>> So ---- if anyone has any info or experience about MySQL custom 
>> collations
>> that would be very useful.
>>
>> best
>>
>> Andrew.
>>
>> -----Original Message-----
>> From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On
>> Behalf Of genereux
>> Sent: 26 March 2014 10:32
>> To: Open source development of the Corpus WorkBench
>> Subject: [CWB] [CQPWeb] diacritics in CQPweb
>>
>> Hi,
>>
>> Here's an issue concerning diacritics in CQPweb.
>>
>> CQPweb stores frequency lists in mysql. Since there are no
>> case-insensitive diacritic-sensitive collations currently available in
>> mysql, a frequency list merges tokens/characters as follows:
>>
>> [e,é,É,Ê,E, ...] [o,ò,ó,Ô,O, ...] ...
>>
>> What we want is:
>>
>> [e,E] [é,É] [Ê,ê] [o,O] [ò,Ò] [ó,Ó] ...
>>
>> We can take care of the case-insensitivity programmatically outside
>> CQPweb/mysql by turning to lowercase records before they enter the DB
>> table. Tables holding frequency lists are then declared as 'collate
>> utf8_bin', which takes care of diacritic-sensitivity.
>>
>> I am wondering if people involved with corpora for languages other than
>> English have dealt with this issue in some other (more elegant) way?
>>
>> Thank you,
>>
>> Michel Généreux
>>
>>
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it
>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it
>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> 



More information about the CWB mailing list