[CWB] UTF8 Bug

Ruprecht von Waldenfels waldenfels at issl.unibe.ch
Thu Jul 28 10:11:40 CEST 2011


Dear everybody,

it seems the last version of CWB does not properly deal with UTF-8 when 
it comes to the calculation of context size. If 25 characters are 
defined as context size, in some cases illegal characters are output, 
presumably because of truncation.



This is a real problem if when works with XML, since the output is then 
no longer valid XML. However, the problem is easy to avoid by choosing a 
different context measure.

I did not know how to submit this as a bug.

All the best!
Ruprecht

PS: a sample corpus (hope this makes it through the web servers!):
<line nr="7">
растерѕаѧ
писмѧ
.
и
покаꙁоуѧ
</line>
<line nr="8">
.
ꙗко
не
поⷣбаєⷮ
прїємати
</line>
<line nr="9">
писанїє
токмо
</line>
<line nr="10">
закона
.
но
въ
самоⷨ
писани
,
</line>


-- 
------------------------------------------------
Ruprecht von Waldenfels
Universitaet Bern
Institut fuer slavische Sprachen und Literaturen
Laenggassstrasse 49 - CH 3005 Bern 9
------------------------------------------------
Tel: +41  31 631 35 83 /  Fax: +41 31  631 39 90
Tel: +49 761 214 66 72 / Mob.: +49 163 230 34 23
------------------------------------------------



More information about the CWB mailing list