[CWB] UTF corpus and frequency list issue

Thilo Wiertz thilo.wiertz at geographie.uni-freiburg.de
Wed Aug 17 16:12:50 CEST 2016


On another issue (that most likely has nothing to do with my previous mail):

I am using a utf-coded corpus that contains german characters such as ä, ö, ü, ß. While this does not cause trouble, e.g. in standard queries, there seems to be an issue in frequency lists. Just one example: in a newspaper corpus, the term "sägte" (as in he "sawed") is listed way up with a frequency of 90,000. Actually clicking on the word gives the correct occurence of the term – which is only one. 

What the calculation is confusing is the term "sagte" (said) with "sägte" ("a" instead of the "ä"). Similar examples occur as frequency list somehow seems to like awkward but rare spellings – so that a football transcription Ri - "bé" - ry links the search result for "be" ;-).

Best wishes,
Thilo

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20160817/8ac2b4e4/attachment.html>


More information about the CWB mailing list