[CWB] Strange issue with character encoding (?) in frequency lists

Tue May 28 08:38:27 CEST 2019

> On 28 May 2019, at 06:16, Scott Sadowsky <ssadowsky at gmail.com> wrote:
> 
> I understand that behavior occurring when the various forms are treated as equivalents (e.g. "naive" and "naïve" with %cd). But in my case, "mi" and "mí" actually correspond to different lemmas ("mi" and "mí", respectively). Would this still be the expected behavior in this case?

Yes.  It's not the desired behaviour, but a quirk of MySQL that we cannot work around.  As soon as you use case-insensitive collation, it will also lump all the diacritics together.  (If you feel that this is extremely bad software design, I don't think either of us would disagree in the least.)

Note that %cd in a query is evaluated by CWB, whereas the aggregation of different forms in a frequency count and in the collocation database is carried out by MySQL based on its collation setting.  The two won't always agree.

> If so, does this mean that lemmas that differ only by diacritical marks are treated as one single lemma by one or more parts of CWB/CQPweb? 

CWB never treats different case/diacritic variants as the same form (whether word form or lemma), it just allows you to search insensitively with the %c and %d flags.

In the frequency aggregation, MySQL lumps the different forms together into a single lemma (or normalized word form).

If I understood a recent discussion with Andrew correctly, newer versions of MySQL will provide finer control over the collation setting, but may not be widely available for the next 5–10 years (depending on how many people stick with some LTS Linux).

> Okay. Here are the results:
> 	• Your query “[lemma="que"]” returned 314,658 matches in 4,701 different texts
> 	• Your query “[lemma="qúe"]” returned 4 matches in 4 different texts
> 	• Your query “[lemma="qué"]” returned 12,924 matches in 2,723 different texts
> 	• (“[lemma="qùe"]” has 0 results)
> 1 and 3 are both actual words in Spanish, so both forms are expected.

That is unfortunate, because they will be conflated into a single lemma by MySQL.  What we do for German corpora is to turn off case-sensitivity in CQPweb (keep in mind that you will have to re-build all frequency tables afterwards).  This also makes frequency counts sensitive to diacritics.

The downside is that collocations and frequency counts for word forms become even more useless. However, lemma-based analyses work well provided that you have a good lemmatizer (which does case-normalization where appropriate).

Best,
Stefan