[CWB] cwb-encode struggling with certain well-formed noncharacter utf8 byte sequences

Yannick Versley yversley at gmail.com
Wed Aug 6 21:01:01 CEST 2014


>
> > However, looking back and mulling it over, I think I may now have
> thought of a way to get cleanup to work by incrementally overwriting
> invalid bytes with "?" and then revalidating. That would mean you'd get
> more than one "?" for a multi-byte bad character, but that is not
> necessarily a problem (it is invalid data, so how many characters it
> "really" represents is undefined).
>
UTF-8 clearly specifies how codepoints are to be encoded:
http://en.wikipedia.org/wiki/UTF-8
As such, a workable solution could be:
- replace the "red" bytes C0/C1/F5..FF by a single quotation mark
- replace invalid codepoints by a single quotation mark

Following a recent discussion on the SQLite mailing list, perhaps we should
> replace invalid codepoints with random bytes instead of "?", in order to
> make corpus admins more aware of the fact that bad and unpredictable things
> happen if you work on invalid data!
>
To me, this seems straight out of the "poke the user in the eye" school of
usability.
Even something blatant like "[INVALID BYTE REMOVED]" and "[INVALID
CODEPOINT REMOVED]" would
make these cases easy to detect. I think that making unpredictable and bad
things happen will not make corpus admins
any more likely to have valid data in the first place (especially when that
data is pulled from random webpages), but give
them yet another unpredictable and bad problem that pops up randomly,
especially if, as Roland pointed out, unicode libraries
differ in their definition of "valid codepoint".

Best wishes,
Yannick
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20140806/f058754c/attachment.html>


More information about the CWB mailing list