[CWB] UTF8 Bug

Hardie, Andrew a.hardie at lancaster.ac.uk
Mon Aug 1 01:33:24 CEST 2011


Hi Ruprecht,

We're aware of this bug - 
https://sourceforge.net/tracker/?func=detail&aid=3046107&group_id=131809&atid=722303

The plan is to insert a quick fix that will prevent malformed byte sequences ASAP, and then sort it out properly for v4.0.

Thanks for the sample corpus - useful!

best

Andrew.


-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Ruprecht von Waldenfels
Sent: 28 July 2011 09:12
To: cwb at sslmit.unibo.it
Subject: [CWB] UTF8 Bug

Dear everybody,

it seems the last version of CWB does not properly deal with UTF-8 when 
it comes to the calculation of context size. If 25 characters are 
defined as context size, in some cases illegal characters are output, 
presumably because of truncation.



This is a real problem if when works with XML, since the output is then 
no longer valid XML. However, the problem is easy to avoid by choosing a 
different context measure.

I did not know how to submit this as a bug.

All the best!
Ruprecht

PS: a sample corpus (hope this makes it through the web servers!):
<line nr="7">
растерѕаѧ
писмѧ
.
и
покаꙁоуѧ
</line>
<line nr="8">
.
ꙗко
не
поⷣбаєⷮ
прїємати
</line>
<line nr="9">
писанїє
токмо
</line>
<line nr="10">
закона
.
но
въ
самоⷨ
писани
,
</line>


-- 
------------------------------------------------
Ruprecht von Waldenfels
Universitaet Bern
Institut fuer slavische Sprachen und Literaturen
Laenggassstrasse 49 - CH 3005 Bern 9
------------------------------------------------
Tel: +41  31 631 35 83 /  Fax: +41 31  631 39 90
Tel: +49 761 214 66 72 / Mob.: +49 163 230 34 23
------------------------------------------------

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list