[CWB] [CQPWeb] diacritics in CQPweb

genereux genereux at clul.ul.pt
Mon Mar 31 10:16:09 CEST 2014


I received feedback from the MariaDB technical team on this issue:

"A case-insensitive, but accent-sensitive collation that is available 
in MariaDB is latin1_general_ci,
http://collation-charts.org/mysql60/mysql604.latin1_general_ci.html. 
But for unicode characters MariaDB does not have general accent 
sensitive collations."

I've tested the latin1_general_ci collation on MariaDB (which should be 
the same on mysql) and it works as advertised.

For lack of better, this may be a convenient temporary solution for 
some corpora.

Best,

Michel



On Thu Mar 27 2014 17:03, genereux wrote:
> The obvious explanation I can find why there are many collations
> (german, hungarian, spanish ...) is that accent and case sensitivities
> can be language specific.
> 
> Yet, it seems to me that a collation offering ci and as across all
> accented characters should be suitable for some if not many languages,
> so my surprise of not finding one ...
> 
> Best regards,
> 
> Michel
> 
> On Thu Mar 27 2014 16:23, Hardie, Andrew wrote:
>> The MariaDB collations are identical to the MySQL ones, as there have
>> been no relevant changes since the fork.
>> The new Firebird collations are a lot better, which I hadn't known;
>> thanks for pointing it out. However, it is somewhat academic, since I
>> am not about to port the whole thing to a Firebird backend!
>> best
>> Andrew.
>> -----Original Message-----
>> From: cwb-bounces at sslmit.unibo.it
>> [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Ciarán Ó Duibhín
>> Sent: 27 March 2014 14:32
>> To: Open source development of the Corpus WorkBench
>> Subject: Re: [CWB] [CQPWeb] diacritics in CQPweb
>> Browsing around, I see that Firebird 2.5 has UTF8 collations called
>> UNICODE, UNICODE_CI and UNICODE_CI_AI (
>> http://www.firebirdsql.org/file/documentation/reference_manuals/reference_material/html/langrefupd25-collations.html#langrefupd25-collations-unicode
>> )
>> For MariaDB, there are many collations containing "ci" in their
>> names, but I can't see whether they are "ai" or "as" (
>> https://mariadb.com/kb/en/supported-character-sets-and-collations/ )
>> It looks like MySQL may have some catching up to do. I suppose there
>> wouldn't be a repository of user-defined collations for MySQL?
>> Ciarán Ó Duibhín
>> ----- Original Message -----
>> From: "Hardie, Andrew" <a.hardie at lancaster.ac.uk>
>> To: "Open source development of the Corpus WorkBench" 
>> <cwb at sslmit.unibo.it>
>> Sent: Thursday, March 27, 2014 12:11 AM
>> Subject: Re: [CWB] [CQPWeb] diacritics in CQPweb
>> 
>> 
>>> Not helpful at all, alas, as you missed the critical context that we 
>>> are
>>> talking about the Unicode collations available *in MySQL*, on which 
>>> CQPweb
>>> depends. These collations include one (utf8_bin) that does "level 
>>> 1", and
>>> one (utf8_general_ci) which does "level 4", but nothing that does 
>>> "level
>>> 3" or "level 2". That was why I was saying I would have to add one 
>>> myself.
>>> See:
>>> http://collation-charts.org/mysql60/mysql604.utf8_general_ci.european.html
>>> (Note that as rotten as MySQL is on this front, so far as I can tell 
>>> other
>>> RDBMSs are even worse, as they seem to link collations to OS 
>>> locales,
>>> which is the last thing you want in this context)
>>> best
>>> Andrew.
>>> -----Original Message-----
>>> From: cwb-bounces at sslmit.unibo.it 
>>> [mailto:cwb-bounces at sslmit.unibo.it] On
>>> Behalf Of Ciarán Ó Duibhín
>>> Sent: 26 March 2014 18:00
>>> To: Open source development of the Corpus WorkBench
>>> Subject: Re: [CWB] [CQPWeb] diacritics in CQPweb
>>> Apologies if this is not relevant, but I thought that unicode 
>>> sorting
>>> recognized four "levels" in comparing two strings:
>>> 1. account is taken of differences in accents, case and specials 2.
>>> account is taken of differences in accents and case, but differences 
>>> in
>>> specials are disregarded 3. account is taken of differences in 
>>> accents,
>>> but differences in case and specials are disregarded 4. differences 
>>> in
>>> accents, case and specials are disregarded
>>> In these terms, what Michel wants is collation at level 3, but is 
>>> getting
>>> collation at level 4.
>>> If the CWB developers have access to a "standard" collation 
>>> procedure, it
>>> should take care of this requirement automatically, with the 
>>> additional
>>> benefit that efficiency considerations can be left to the 
>>> implementors of
>>> the standard procedure!
>>> (Specials are non-alphabetic characters, including punctuation, 
>>> which may
>>> be present in the strings.)
>>> For more info, see http://en.wikipedia.org/wiki/ISO_14651 or
>>> http://www.unicode.org/reports/tr10/
>>> I hope this is helpful,
>>> Ciarán Ó Duibhín.
>>> ----- Original Message -----
>>> From: "Hardie, Andrew" <a.hardie at lancaster.ac.uk>
>>> To: "Open source development of the Corpus WorkBench"
>>> <cwb at sslmit.unibo.it>
>>> Sent: Wednesday, March 26, 2014 5:16 PM
>>> Subject: Re: [CWB] [CQPWeb] diacritics in CQPweb
>>> 
>>> 
>>>> Unfortunately, at the moment as you say there is a choice between 
>>>> CS/DS
>>>> and CI/DI, while for most linguistic purposes we want CI/DS. One of 
>>>> my
>>>> planned developments is to introduce custom collations that can be 
>>>> loaded
>>>> into MySQL that will allow CI/DS because I want it too! ( I think I 
>>>> would
>>>> have to define one from scratch based on automated mapping from the
>>>> Unicode standard datadase UNIDATA.TXT).
>>>> However, I need to find out first how this will affect performance. 
>>>> I
>>>> have
>>>> tried to find out whether using a custom, rather than built-in, 
>>>> collation
>>>> affects MySQL performance (and also what effect the complexity of 
>>>> the
>>>> custom collation has), but cannot find much online about it. So I 
>>>> will
>>>> need to take time to do some empirical experimentation at some 
>>>> point.
>>>> So ---- if anyone has any info or experience about MySQL custom
>>>> collations
>>>> that would be very useful.
>>>> best
>>>> Andrew.
>>>> -----Original Message-----
>>>> From: cwb-bounces at sslmit.unibo.it 
>>>> [mailto:cwb-bounces at sslmit.unibo.it] On
>>>> Behalf Of genereux
>>>> Sent: 26 March 2014 10:32
>>>> To: Open source development of the Corpus WorkBench
>>>> Subject: [CWB] [CQPWeb] diacritics in CQPweb
>>>> Hi,
>>>> Here's an issue concerning diacritics in CQPweb.
>>>> CQPweb stores frequency lists in mysql. Since there are no
>>>> case-insensitive diacritic-sensitive collations currently available 
>>>> in
>>>> mysql, a frequency list merges tokens/characters as follows:
>>>> [e,é,É,Ê,E, ...] [o,ò,ó,Ô,O, ...] ...
>>>> What we want is:
>>>> [e,E] [é,É] [Ê,ê] [o,O] [ò,Ò] [ó,Ó] ...
>>>> We can take care of the case-insensitivity programmatically outside
>>>> CQPweb/mysql by turning to lowercase records before they enter the 
>>>> DB
>>>> table. Tables holding frequency lists are then declared as 'collate
>>>> utf8_bin', which takes care of diacritic-sensitivity.
>>>> I am wondering if people involved with corpora for languages other 
>>>> than
>>>> English have dealt with this issue in some other (more elegant) 
>>>> way?
>>>> Thank you,
>>>> Michel Généreux
>>>> 
>>>> _______________________________________________
>>>> CWB mailing list
>>>> CWB at sslmit.unibo.it
>>>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>>>> _______________________________________________
>>>> CWB mailing list
>>>> CWB at sslmit.unibo.it
>>>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>>>> 
>>> _______________________________________________
>>> CWB mailing list
>>> CWB at sslmit.unibo.it
>>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>>> _______________________________________________
>>> CWB mailing list
>>> CWB at sslmit.unibo.it
>>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>>> 
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it
>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it
>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list