[CWB] [CQPWeb] diacritics in CQPweb

genereux genereux at clul.ul.pt
Thu Mar 27 17:03:18 CET 2014


The obvious explanation I can find why there are many collations 
(german, hungarian, spanish ...) is that accent and case sensitivities 
can be language specific.

Yet, it seems to me that a collation offering ci and as across all 
accented characters should be suitable for some if not many languages, 
so my surprise of not finding one ...

Best regards,

Michel

On Thu Mar 27 2014 16:23, Hardie, Andrew wrote:
> The MariaDB collations are identical to the MySQL ones, as there have
> been no relevant changes since the fork.
> 
> The new Firebird collations are a lot better, which I hadn't known;
> thanks for pointing it out. However, it is somewhat academic, since I
> am not about to port the whole thing to a Firebird backend!
> 
> best
> 
> Andrew.
> 
> -----Original Message-----
> From: cwb-bounces at sslmit.unibo.it
> [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Ciarán Ó Duibhín
> Sent: 27 March 2014 14:32
> To: Open source development of the Corpus WorkBench
> Subject: Re: [CWB] [CQPWeb] diacritics in CQPweb
> 
> Browsing around, I see that Firebird 2.5 has UTF8 collations called
> UNICODE, UNICODE_CI and UNICODE_CI_AI (
> http://www.firebirdsql.org/file/documentation/reference_manuals/reference_material/html/langrefupd25-collations.html#langrefupd25-collations-unicode
> )
> 
> For MariaDB, there are many collations containing "ci" in their
> names, but I can't see whether they are "ai" or "as" (
> https://mariadb.com/kb/en/supported-character-sets-and-collations/ )
> 
> It looks like MySQL may have some catching up to do. I suppose there
> wouldn't be a repository of user-defined collations for MySQL?
> 
> Ciarán Ó Duibhín
> 
> ----- Original Message -----
> From: "Hardie, Andrew" <a.hardie at lancaster.ac.uk>
> To: "Open source development of the Corpus WorkBench" 
> <cwb at sslmit.unibo.it>
> Sent: Thursday, March 27, 2014 12:11 AM
> Subject: Re: [CWB] [CQPWeb] diacritics in CQPweb
> 
> 
>> Not helpful at all, alas, as you missed the critical context that we 
>> are
>> talking about the Unicode collations available *in MySQL*, on which 
>> CQPweb
>> depends. These collations include one (utf8_bin) that does "level 1", 
>> and
>> one (utf8_general_ci) which does "level 4", but nothing that does 
>> "level
>> 3" or "level 2". That was why I was saying I would have to add one 
>> myself.
>> 
>> See:
>> http://collation-charts.org/mysql60/mysql604.utf8_general_ci.european.html
>> 
>> (Note that as rotten as MySQL is on this front, so far as I can tell 
>> other
>> RDBMSs are even worse, as they seem to link collations to OS locales,
>> which is the last thing you want in this context)
>> 
>> best
>> 
>> Andrew.
>> 
>> -----Original Message-----
>> From: cwb-bounces at sslmit.unibo.it 
>> [mailto:cwb-bounces at sslmit.unibo.it] On
>> Behalf Of Ciarán Ó Duibhín
>> Sent: 26 March 2014 18:00
>> To: Open source development of the Corpus WorkBench
>> Subject: Re: [CWB] [CQPWeb] diacritics in CQPweb
>> 
>> Apologies if this is not relevant, but I thought that unicode sorting
>> recognized four "levels" in comparing two strings:
>> 1. account is taken of differences in accents, case and specials 2.
>> account is taken of differences in accents and case, but differences 
>> in
>> specials are disregarded 3. account is taken of differences in 
>> accents,
>> but differences in case and specials are disregarded 4. differences 
>> in
>> accents, case and specials are disregarded
>> 
>> In these terms, what Michel wants is collation at level 3, but is 
>> getting
>> collation at level 4.
>> 
>> If the CWB developers have access to a "standard" collation 
>> procedure, it
>> should take care of this requirement automatically, with the 
>> additional
>> benefit that efficiency considerations can be left to the 
>> implementors of
>> the standard procedure!
>> 
>> (Specials are non-alphabetic characters, including punctuation, which 
>> may
>> be present in the strings.)
>> 
>> For more info, see http://en.wikipedia.org/wiki/ISO_14651 or
>> http://www.unicode.org/reports/tr10/
>> 
>> I hope this is helpful,
>> Ciarán Ó Duibhín.
>> 
>> ----- Original Message -----
>> From: "Hardie, Andrew" <a.hardie at lancaster.ac.uk>
>> To: "Open source development of the Corpus WorkBench"
>> <cwb at sslmit.unibo.it>
>> Sent: Wednesday, March 26, 2014 5:16 PM
>> Subject: Re: [CWB] [CQPWeb] diacritics in CQPweb
>> 
>> 
>>> Unfortunately, at the moment as you say there is a choice between 
>>> CS/DS
>>> and CI/DI, while for most linguistic purposes we want CI/DS. One of 
>>> my
>>> planned developments is to introduce custom collations that can be 
>>> loaded
>>> into MySQL that will allow CI/DS because I want it too! ( I think I 
>>> would
>>> have to define one from scratch based on automated mapping from the
>>> Unicode standard datadase UNIDATA.TXT).
>>> 
>>> However, I need to find out first how this will affect performance. 
>>> I
>>> have
>>> tried to find out whether using a custom, rather than built-in, 
>>> collation
>>> affects MySQL performance (and also what effect the complexity of 
>>> the
>>> custom collation has), but cannot find much online about it. So I 
>>> will
>>> need to take time to do some empirical experimentation at some 
>>> point.
>>> 
>>> So ---- if anyone has any info or experience about MySQL custom
>>> collations
>>> that would be very useful.
>>> 
>>> best
>>> 
>>> Andrew.
>>> 
>>> -----Original Message-----
>>> From: cwb-bounces at sslmit.unibo.it 
>>> [mailto:cwb-bounces at sslmit.unibo.it] On
>>> Behalf Of genereux
>>> Sent: 26 March 2014 10:32
>>> To: Open source development of the Corpus WorkBench
>>> Subject: [CWB] [CQPWeb] diacritics in CQPweb
>>> 
>>> Hi,
>>> 
>>> Here's an issue concerning diacritics in CQPweb.
>>> 
>>> CQPweb stores frequency lists in mysql. Since there are no
>>> case-insensitive diacritic-sensitive collations currently available 
>>> in
>>> mysql, a frequency list merges tokens/characters as follows:
>>> 
>>> [e,é,É,Ê,E, ...] [o,ò,ó,Ô,O, ...] ...
>>> 
>>> What we want is:
>>> 
>>> [e,E] [é,É] [Ê,ê] [o,O] [ò,Ò] [ó,Ó] ...
>>> 
>>> We can take care of the case-insensitivity programmatically outside
>>> CQPweb/mysql by turning to lowercase records before they enter the 
>>> DB
>>> table. Tables holding frequency lists are then declared as 'collate
>>> utf8_bin', which takes care of diacritic-sensitivity.
>>> 
>>> I am wondering if people involved with corpora for languages other 
>>> than
>>> English have dealt with this issue in some other (more elegant) way?
>>> 
>>> Thank you,
>>> 
>>> Michel Généreux
>>> 
>>> 
>>> _______________________________________________
>>> CWB mailing list
>>> CWB at sslmit.unibo.it
>>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>>> _______________________________________________
>>> CWB mailing list
>>> CWB at sslmit.unibo.it
>>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>>> 
>> 
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it
>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it
>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>> 
> 
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list