[CWB] UTF corpus and frequency list issue

Hardie, Andrew a.hardie at lancaster.ac.uk
Sun Sep 4 21:10:35 CEST 2016


Interesting. I threw together a PHP equivalent of that Python demo, so I could run it within the CQPweb environment (code below in case anyone cares!) 

I have to suspect this is a bug (or perhaps, rather, a limitation) in the MySQL collations. Under the Unicode collation rules I don't think it's allowed for the NFD and NFC forms of "the same thing" to exhibit different collation behaviour. I could be wrong about that though. I freely admit that the fine details of the UCA are beyond me. 

Anyway, correct behaviour or not, the method demonstrated does clearly work. However, this would only be a practical solution to the general issue if the only thing the collation was used for was to look up words in the table in this way. Alas it's not - we also use the collation for the sake of combining together the binary-collated elements from the CWB index when creating the frequency tables, for doing table joins of various kinds when calculating collocations, etc. etc. So it's not really workable. Moreover, to make it so that the string forms in the MySQL DB don't binary-match the string forms in the CWB index would be to create an entire world of pain in terms of the required conversions and concomitant edge cases and possible bugs...

One primary use-case for diacritic-insensitive comparison is actually for English text. As it is rather common for diacritics to be left off words like "café" and "naïve" - and also because there is a certain level of disagreement about where the diacritics actually are, e.g. the New York Review of Books insists on spelling words like "coöperation" with a diaresis if I recall correctly - it's often desirable to catch them all at once when searching. "naïve"%d will also catch "naive" .

best

Andrew.

<?php

include ("lib/config.inc.php");

$mysql_link = @mysql_connect($mysql_server, $mysql_webuser, $mysql_webpass, false, 128);
mysql_select_db($mysql_schema);

mysql_query("create table zzz_experidict (id integer not null auto_increment primary key, val varchar(256) not null default '') character set utf8 collate utf8_general_ci");

foreach ( explode(' ', 'čeří ceři ceri') as $w )
{
        $n_w = Normalizer::normalize($w, Normalizer::FORM_D);
        mysql_query("insert into zzz_experidict (val) values ('$n_w')");
}

foreach ( explode(' ', 'čeří ČEří ceři CeřI ceri CeRi') as $s)
{
        echo "Searched for $s. Found: \n";
        $n_s = Normalizer::normalize($s, Normalizer::FORM_D);
        $result = mysql_query("select val from zzz_experidict where val = '$n_s'");
        while (false !== ($o = mysql_fetch_object($result)))
                echo "\t", $o->val, "\n";
}

mysql_query("drop table if exists zzz_experidict");

/*
Searched for čeří. Found:
        čeří
Searched for ČEří. Found:
        čeří
Searched for ceři. Found:
        ceři
Searched for CeřI. Found:
        ceři
Searched for ceri. Found:
        ceri
Searched for CeRi. Found:
        ceri
*/


-----Original Message-----
From: cwb-bounces at liste.sslmit.unibo.it [mailto:cwb-bounces at liste.sslmit.unibo.it] On Behalf Of David Lukes
Sent: 01 September 2016 17:38
To: cwb at sslmit.unibo.it
Subject: Re: [CWB] UTF corpus and frequency list issue

Hi all,

 > This is a limitation in the collations provided by MySQL: ideally,
 > we’d want to be able to switch case-sensitivity and
 > diacritic-sensitivity independently, but the choice of collations you
 > get don’t afford that.

My two cents: it's possible to simulate case-insensitive,
diacritic-*sensitive* collation in MySQL by NFD normalizing strings
before passing them on to the database, in conjunction with
`utf8_general_ci` collation. See this gist for a demo (you need
python3):

<https://gist.github.com/dlukes/25467d658a5c5f53be0cfb55969e7dcd>

It might be a better default behavior, since in the context of
linguistics, one almost never (?) wants diacritic-insensitive
comparisons. Though of course I've no idea how much effort it would take
to incorporate this into the codebase, so it might not be worth the
hassle :)

Best,

David

---
David Lukeš
Institute of the Czech National Corpus
Faculty of Arts, Charles University
Prague, Czech Republic

_______________________________________________
CWB mailing list
CWB at liste.sslmit.unibo.it
http://liste.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list