[CWB] cwb-encode struggling with certain well-formed noncharacter utf8 byte sequences

Stefan Evert stefanML at collocations.de
Wed Aug 6 18:58:41 CEST 2014


> However, looking back and mulling it over, I think I may now have thought of a way to get cleanup to work by incrementally overwriting invalid bytes with "?" and then revalidating. That would mean you'd get more than one "?" for a multi-byte bad character, but that is not necessarily a problem (it is invalid data, so how many characters it "really" represents is undefined).

It would be great to have -C work for UTF-8, too!  I've repeatedly had substantial problems importing Web corpora into CWB because they are supposed to be UTF-8 but contain invalid bytes (from Web pages whose encoding wasn't recognized properly or which contain data in different encodings).  And it's a pain to run 9 billion words of text through an external validator to remove or re-encode the offending bits ...

(Fortunately, this was English data, so I simply encoded as ascii with -C ... but still a bummer.)

Following a recent discussion on the SQLite mailing list, perhaps we should replace invalid codepoints with random bytes instead of "?", in order to make corpus admins more aware of the fact that bad and unpredictable things happen if you work on invalid data!

Cheers,
Stefan


More information about the CWB mailing list