[CWB] UTF8 Bug
Hardie, Andrew
a.hardie at lancaster.ac.uk
Mon Aug 1 01:33:24 CEST 2011
Hi Ruprecht,
We're aware of this bug -
https://sourceforge.net/tracker/?func=detail&aid=3046107&group_id=131809&atid=722303
The plan is to insert a quick fix that will prevent malformed byte sequences ASAP, and then sort it out properly for v4.0.
Thanks for the sample corpus - useful!
best
Andrew.
-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Ruprecht von Waldenfels
Sent: 28 July 2011 09:12
To: cwb at sslmit.unibo.it
Subject: [CWB] UTF8 Bug
Dear everybody,
it seems the last version of CWB does not properly deal with UTF-8 when
it comes to the calculation of context size. If 25 characters are
defined as context size, in some cases illegal characters are output,
presumably because of truncation.
This is a real problem if when works with XML, since the output is then
no longer valid XML. However, the problem is easy to avoid by choosing a
different context measure.
I did not know how to submit this as a bug.
All the best!
Ruprecht
PS: a sample corpus (hope this makes it through the web servers!):
<line nr="7">
растерѕаѧ
писмѧ
.
и
покаꙁоуѧ
</line>
<line nr="8">
.
ꙗко
не
поⷣбаєⷮ
прїємати
</line>
<line nr="9">
писанїє
токмо
</line>
<line nr="10">
закона
.
но
въ
самоⷨ
писани
,
</line>
--
------------------------------------------------
Ruprecht von Waldenfels
Universitaet Bern
Institut fuer slavische Sprachen und Literaturen
Laenggassstrasse 49 - CH 3005 Bern 9
------------------------------------------------
Tel: +41 31 631 35 83 / Fax: +41 31 631 39 90
Tel: +49 761 214 66 72 / Mob.: +49 163 230 34 23
------------------------------------------------
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb
More information about the CWB
mailing list