[CWB] UTF corpus and frequency list issue

David Lukes david.lukes at ff.cuni.cz
Mon Sep 5 11:19:24 CEST 2016


 > a bug (or perhaps, rather, a limitation) in the MySQL collations.
 > Under the Unicode collation rules I don't think it's allowed for the
 > NFD and NFC forms of "the same thing" to exhibit different collation
 > behaviour.

I think it's a conscious deviation from the spec / performance
optimization, which probably has no real benefits anymore but is kept
around for compatibility. Nowadays, it seems there's no good reason to
use `utf8_general_ci` over `utf8_unicode_ci` (or indeed,
`utf8mb4_unicode_ci`) unless you want to exploit this failure to treat
combining characters correctly. There are applications where this can be
handy (especially if the relevant columns serve a single purpose and
access to them is tightly controlled)...

 > we also use the collation for the sake of combining together the
 > binary-collated elements from the CWB index when creating the
 > frequency tables, for doing table joins of various kinds when
 > calculating collocations, etc. etc.

... but CQPweb doesn't seem to be one of them :) This sounds like all
kinds of subtle and not-so-subtle breakage could ensue.

 > One primary use-case for diacritic-insensitive comparison is actually
 > for English text. As it is rather common for diacritics to be left off
 > words like "café" and "naïve"

That's a good point, noted! As a person working primarily with a
morphology-rich language (Czech), I tend to think of word form
normalization as having to be dealt with on the annotation level anyway
using full-on lemmatization, so this didn't really cross my mind :)

Best,

David

On 09/04/2016 09:10 PM, Hardie, Andrew wrote:
> Interesting. I threw together a PHP equivalent of that Python demo, so I could run it within the CQPweb environment (code below in case anyone cares!)
>
> I have to suspect this is a bug (or perhaps, rather, a limitation) in the MySQL collations. Under the Unicode collation rules I don't think it's allowed for the NFD and NFC forms of "the same thing" to exhibit different collation behaviour. I could be wrong about that though. I freely admit that the fine details of the UCA are beyond me.
>
> Anyway, correct behaviour or not, the method demonstrated does clearly work. However, this would only be a practical solution to the general issue if the only thing the collation was used for was to look up words in the table in this way. Alas it's not - we also use the collation for the sake of combining together the binary-collated elements from the CWB index when creating the frequency tables, for doing table joins of various kinds when calculating collocations, etc. etc. So it's not really workable. Moreover, to make it so that the string forms in the MySQL DB don't binary-match the string forms in the CWB index would be to create an entire world of pain in terms of the required conversions and concomitant edge cases and possible bugs...
>
> One primary use-case for diacritic-insensitive comparison is actually for English text. As it is rather common for diacritics to be left off words like "café" and "naïve" - and also because there is a certain level of disagreement about where the diacritics actually are, e.g. the New York Review of Books insists on spelling words like "coöperation" with a diaresis if I recall correctly - it's often desirable to catch them all at once when searching. "naïve"%d will also catch "naive" .
>
> best
>
> Andrew.
>
> <?php
>
> include ("lib/config.inc.php");
>
> $mysql_link = @mysql_connect($mysql_server, $mysql_webuser, $mysql_webpass, false, 128);
> mysql_select_db($mysql_schema);
>
> mysql_query("create table zzz_experidict (id integer not null auto_increment primary key, val varchar(256) not null default '') character set utf8 collate utf8_general_ci");
>
> foreach ( explode(' ', 'čeří ceři ceri') as $w )
> {
>          $n_w = Normalizer::normalize($w, Normalizer::FORM_D);
>          mysql_query("insert into zzz_experidict (val) values ('$n_w')");
> }
>
> foreach ( explode(' ', 'čeří ČEří ceři CeřI ceri CeRi') as $s)
> {
>          echo "Searched for $s. Found: \n";
>          $n_s = Normalizer::normalize($s, Normalizer::FORM_D);
>          $result = mysql_query("select val from zzz_experidict where val = '$n_s'");
>          while (false !== ($o = mysql_fetch_object($result)))
>                  echo "\t", $o->val, "\n";
> }
>
> mysql_query("drop table if exists zzz_experidict");
>
> /*
> Searched for čeří. Found:
>          čeří
> Searched for ČEří. Found:
>          čeří
> Searched for ceři. Found:
>          ceři
> Searched for CeřI. Found:
>          ceři
> Searched for ceri. Found:
>          ceri
> Searched for CeRi. Found:
>          ceri
> */
>
>
> -----Original Message-----
> From: cwb-bounces at liste.sslmit.unibo.it [mailto:cwb-bounces at liste.sslmit.unibo.it] On Behalf Of David Lukes
> Sent: 01 September 2016 17:38
> To: cwb at sslmit.unibo.it
> Subject: Re: [CWB] UTF corpus and frequency list issue
>
> Hi all,
>
>   > This is a limitation in the collations provided by MySQL: ideally,
>   > we’d want to be able to switch case-sensitivity and
>   > diacritic-sensitivity independently, but the choice of collations you
>   > get don’t afford that.
>
> My two cents: it's possible to simulate case-insensitive,
> diacritic-*sensitive* collation in MySQL by NFD normalizing strings
> before passing them on to the database, in conjunction with
> `utf8_general_ci` collation. See this gist for a demo (you need
> python3):
>
> <https://gist.github.com/dlukes/25467d658a5c5f53be0cfb55969e7dcd>
>
> It might be a better default behavior, since in the context of
> linguistics, one almost never (?) wants diacritic-insensitive
> comparisons. Though of course I've no idea how much effort it would take
> to incorporate this into the codebase, so it might not be worth the
> hassle :)
>
> Best,
>
> David
>
> ---
> David Lukeš
> Institute of the Czech National Corpus
> Faculty of Arts, Charles University
> Prague, Czech Republic
>
> _______________________________________________
> CWB mailing list
> CWB at liste.sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb



More information about the CWB mailing list