[CWB] cwb-encode struggling with certain well-formed noncharacter utf8 byte sequences

Hardie, Andrew a.hardie at lancaster.ac.uk
Wed Aug 6 13:34:14 CEST 2014


Hi Enrique,

The characters from U+fdd0 to U+fdef are reserved for "process-internal use": that is, whatever program is introducing these in its output is Doing Things Wrong and acting against an explicit mandate of the Unicode standard. So, the right thing here is either to remove those characters, or (better yet) fix whatever rogue program is introducing them.

The last two characters on your list are U+FFFE and U+FFFF, which are guaranteed noncharacters (often handled as wrong-endian BOM and error code). For them to be appearing in your data files is very bad news indeed....

There is not any way to override cwb's insistence on UTF-8 data being well formed (other than encoding the text as some other character set, which causes other problems) and it is not something you should want to do anyway.

best

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Enrique Manjavacas
Sent: 06 August 2014 10:02
To: cwb at sslmit.unibo.it
Subject: [CWB] cwb-encode struggling with certain well-formed noncharacter utf8 byte sequences

Hi,

I have an issue when trying to encode a corpus with the utf8 charset option in version 3.4.7, The encoding process aborts with the message:

Encoding error: an invalid byte or byte sequence for charset "utf8" was encountered.

However, the documents have been carefully checked for utf8 wellformedness.
Inspecting the files has shown that the problematic codepoints are the noncharacters:

ef b7 90
ef b7 93
ef b7 a1
ef b7 af
ef bf be
ef bf bf

and filtering them before encoding resolves the issue, but still I was wondering whether there is some way of getting cwb-encode to accept such input.

thanks!
Enrique


_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list