[CWB] UTF corpus and frequency list issue

Hardie, Andrew a.hardie at lancaster.ac.uk
Wed Aug 17 18:32:12 CEST 2016


Hi Thilo,

By default frequency list tables use the MySQL case-insensitive collation. This collation is, unfortunately, also accent-insensitive.

You can switch to case-sensitive mode in “Corpus settings” but please take note of the warning in the UI “if you change this setting, you must delete and recreate all frequency lists and delete cached databases”. Case-sensitive mode is also accent-sensitive.

This is a limitation in the collations provided by MySQL: ideally, we’d want to be able to switch case-sensitivity and diacritic-sensitivity independently, but the choice of collations you get don’t afford that.

best

Andrew

From: cwb-bounces at liste.sslmit.unibo.it [mailto:cwb-bounces at liste.sslmit.unibo.it] On Behalf Of Thilo Wiertz
Sent: 17 August 2016 15:13
To: Open source development of the Corpus WorkBench
Subject: [CWB] UTF corpus and frequency list issue

On another issue (that most likely has nothing to do with my previous mail):

I am using a utf-coded corpus that contains german characters such as ä, ö, ü, ß. While this does not cause trouble, e.g. in standard queries, there seems to be an issue in frequency lists. Just one example: in a newspaper corpus, the term "sägte" (as in he "sawed") is listed way up with a frequency of 90,000. Actually clicking on the word gives the correct occurence of the term – which is only one.

What the calculation is confusing is the term "sagte" (said) with "sägte" ("a" instead of the "ä"). Similar examples occur as frequency list somehow seems to like awkward but rare spellings – so that a football transcription Ri - "bé" - ry links the search result for "be" ;-).

Best wishes,
Thilo

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20160817/e82c23d1/attachment.html>


More information about the CWB mailing list