[CWB] Strange issue with character encoding (?) in frequency lists

Tue May 28 14:50:29 CEST 2019

>>> If I understood a recent discussion with Andrew correctly, newer versions of MySQL will provide finer control over the collation setting, but may not be widely available for the next 5–10 years (depending on how many people stick with some LTS Linux).

Yes that's right. They are already available if you have MySQL v 8; but they are not in 5.7, or in any MariaDB to date. 

I have been writing some code to allow CQPweb to work out dynamically what the best collation is out of those available on the server, depending on the corpus settings. This code will kick in when we shift the entire database to utf8mb4. It will maintain the old, wonky behaviour if only the old-style collations are available, but apply correct behaviour if it can.

Incidentally - so far case sensitivity (and, when I add it, diacritic sensitivity) is set at the level of the corpus. Should it be set at the level of each different annotation (p-attribute)?

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> On Behalf Of Stefan Evert
Sent: 28 May 2019 07:38
To: Scott Sadowsky <ssadowsky at gmail.com>
Cc: CWBdev Mailing List <cwb at sslmit.unibo.it>
Subject: Re: [CWB] Strange issue with character encoding (?) in frequency lists

> On 28 May 2019, at 06:16, Scott Sadowsky <ssadowsky at gmail.com> wrote:
> 
> I understand that behavior occurring when the various forms are treated as equivalents (e.g. "naive" and "naïve" with %cd). But in my case, "mi" and "mí" actually correspond to different lemmas ("mi" and "mí", respectively). Would this still be the expected behavior in this case?

Yes.  It's not the desired behaviour, but a quirk of MySQL that we cannot work around.  As soon as you use case-insensitive collation, it will also lump all the diacritics together.  (If you feel that this is extremely bad software design, I don't think either of us would disagree in the least.)

Note that %cd in a query is evaluated by CWB, whereas the aggregation of different forms in a frequency count and in the collocation database is carried out by MySQL based on its collation setting.  The two won't always agree.

> If so, does this mean that lemmas that differ only by diacritical marks are treated as one single lemma by one or more parts of CWB/CQPweb? 

CWB never treats different case/diacritic variants as the same form (whether word form or lemma), it just allows you to search insensitively with the %c and %d flags.

In the frequency aggregation, MySQL lumps the different forms together into a single lemma (or normalized word form).

If I understood a recent discussion with Andrew correctly, newer versions of MySQL will provide finer control over the collation setting, but may not be widely available for the next 5–10 years (depending on how many people stick with some LTS Linux).

> Okay. Here are the results:
> 	• Your query “[lemma="que"]” returned 314,658 matches in 4,701 different texts
> 	• Your query “[lemma="qúe"]” returned 4 matches in 4 different texts
> 	• Your query “[lemma="qué"]” returned 12,924 matches in 2,723 different texts
> 	• (“[lemma="qùe"]” has 0 results)
> 1 and 3 are both actual words in Spanish, so both forms are expected.

That is unfortunate, because they will be conflated into a single lemma by MySQL.  What we do for German corpora is to turn off case-sensitivity in CQPweb (keep in mind that you will have to re-build all frequency tables afterwards).  This also makes frequency counts sensitive to diacritics.

The downside is that collocations and frequency counts for word forms become even more useless. However, lemma-based analyses work well provided that you have a good lemmatizer (which does case-normalization where appropriate).

Best,
Stefan

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://liste.sslmit.unibo.it/mailman/listinfo/cwb