[CWB] cwb-encode struggling with certain well-formed noncharacter utf8 byte sequences

Stefan Evert stefanML at collocations.de
Wed Aug 6 21:28:26 CEST 2014


>> Following a recent discussion on the SQLite mailing list, perhaps we should replace invalid codepoints with random bytes instead of "?", in order to make corpus admins more aware of the fact that bad and unpredictable things happen if you work on invalid data!
> 
> To me, this seems straight out of the "poke the user in the eye" school of usability.

Hey, how did you find out that I'm a proud alumnus of this school?

> Even something blatant like "[INVALID BYTE REMOVED]" and "[INVALID CODEPOINT REMOVED]" would
> make these cases easy to detect. I think that making unpredictable and bad things happen will not make corpus admins
> any more likely to have valid data in the first place (especially when that data is pulled from random webpages), but give
> them yet another unpredictable and bad problem that pops up randomly

Admittedly, inserting random stuff isn't really appropriate in this particular case.  The suggestions for SQLite had to do with a case where the standard specified "undefined behaviour" – any deterministic solution would thus violate the standard by not being undefined. :-)

> , especially if, as Roland pointed out, unicode libraries
> differ in their definition of "valid codepoint"

I think that in our cases the definition of "valid codepoint" is quite unambiguous: "any codepoint that doesn't make the libraries we use – in particular Glib and PCRE – choke and crash CQP".  Sticking with Glib's definition of well-formedness seems a wise move.

Best,
Stefan




More information about the CWB mailing list