[CWB] UTF8 Bug
Ruprecht von Waldenfels
waldenfels at issl.unibe.ch
Thu Jul 28 10:11:40 CEST 2011
Dear everybody,
it seems the last version of CWB does not properly deal with UTF-8 when
it comes to the calculation of context size. If 25 characters are
defined as context size, in some cases illegal characters are output,
presumably because of truncation.
This is a real problem if when works with XML, since the output is then
no longer valid XML. However, the problem is easy to avoid by choosing a
different context measure.
I did not know how to submit this as a bug.
All the best!
Ruprecht
PS: a sample corpus (hope this makes it through the web servers!):
<line nr="7">
растерѕаѧ
писмѧ
.
и
покаꙁоуѧ
</line>
<line nr="8">
.
ꙗко
не
поⷣбаєⷮ
прїємати
</line>
<line nr="9">
писанїє
токмо
</line>
<line nr="10">
закона
.
но
въ
самоⷨ
писани
,
</line>
--
------------------------------------------------
Ruprecht von Waldenfels
Universitaet Bern
Institut fuer slavische Sprachen und Literaturen
Laenggassstrasse 49 - CH 3005 Bern 9
------------------------------------------------
Tel: +41 31 631 35 83 / Fax: +41 31 631 39 90
Tel: +49 761 214 66 72 / Mob.: +49 163 230 34 23
------------------------------------------------
More information about the CWB
mailing list