[CWB] cwb-encode struggling with certain well-formed noncharacter utf8 byte sequences

Wed Aug 6 11:02:15 CEST 2014

Hi,

I have an issue when trying to encode a corpus with the utf8 charset option in version 3.4.7,
The encoding process aborts with the message:

Encoding error: an invalid byte or byte sequence for charset "utf8" was encountered.

However, the documents have been carefully checked for utf8 wellformedness.
Inspecting the files has shown that the problematic codepoints are the noncharacters:

ef b7 90
ef b7 93
ef b7 a1
ef b7 af
ef bf be
ef bf bf

and filtering them before encoding resolves the issue, but still I was wondering whether there
is some way of getting cwb-encode to accept such input.

thanks!
Enrique