[CWB] cwb-encode struggling with certain well-formed noncharacter utf8 byte sequences

Wed Aug 6 23:28:10 CEST 2014

To be clear: if your program is not the one that *introduces* the invalid codepoints, then it's not the program that I was calling "rogue". The program that produced the original "&#65535;"  (or whatever) found on the web is the rogue.

Some further design considerations: to enhance user-friendliness CWB v 4 may well have a command line flag allowing you to specify whether you would rather have invalid UTF-8 sequences cause an arbitrary number of demons to fly out of your nose (subject to the phase of the moon and the current exchange rate of the renminbi to the Austrian schilling) or be overwritten by "?".

best

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Roland Schäfer
Sent: 06 August 2014 19:03
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] cwb-encode struggling with certain well-formed noncharacter utf8 byte sequences

Hi,

thanks, Serge, for looking up the passage in the Unicode specification.
I didn't have time for that, and that is the only reason I didn't reply earlier.

I am the author of the (web corpus creation) software which was called "rogue" in Andrew's first post. First of all, I want to stress that my code does not introduce such codepoints but simply fails to filter them.
(In fact "failED to filter them", because I have just added an option to replace non-character codepoints.) There are web documents actually containing such characters, even in the quite absurd form of &#65535; Have fun with, for example:

http://www.b-guard.nl/web/Nieuws/

Also see the message just posted by Stefan. (@Stefan: If the 9 billion words came from us, I can assure you that such problems should not occur with future releases.)

Furthermore, please grant other programmers the benefit of the doubt and consider that

iconv -f utf8 -t utf8 -c

(where -c is for "Omit invalid characters from output.") does NOT filter the sequences under discussion. Also, isutf8 does NOT raise an error for the file Enrique was working with. Finally, I internally use non-lenient conversion from any encoding to UTF16 and then convert from that to UTF8 using ICU, and even the very picky ICU library did not complain about the strings in question (obviously different from glib). In addition to this implicit check, I have my own UTF8 checking routines implemented according to the official specification, including the passage quotes by Serge.

So, non-character codepoints should be avoided but are not ill-formed according to that passage, and they are accepted as well-formed by two standard tools which perform UTF8 well-formedness checking as well as the major Unicode handling library. As opposed to single offending bytes or broken sequences, the UTF8 byte sequences under discussion are technically correct/parsable and thus can be replaced with the replacement character.

Please do not take this the wrong way: I do not expect CWB to handle the sequences in question (although - implementation details aside - this would be possible), and I agree that "applications should not attempt to exchange them" means that they should be fixed at the earliest stage, which is what I am doing now.

Best,
Roland

On 08/06/2014 06:19 PM, Hardie, Andrew wrote:
> Hi Serge,
> 
>>> As I understand, Enrique has his own point of view of "documents [that] have been carefully checked for utf8 wellformedness".
> 
> This is not an issue that admits of multiple points of view. Either it's wellformed or it's not. If it contains [ef bf bf], then it's not. 
> 
> The question that admits of multiple points of view is, rather, what is an appropriate way to deal with *any* invalid sequence (whether it be a nicely-encoded-but-illegal character, or a bunch of badly-encoded bytes) when encountered. See below.
> 
>>> Can you please be more explicit when you say "There is not any way to override cwb's insistence on UTF-8 data being well formed"?
> 
> I mean the *user* has no way to override this. All input strings are checked for UTF-8 validity and if they are not valid then there is an abort. 
> 
> The story behind this: we simply don't let users import invalid UTF-8 into an indexed corpus. If we did, we would not be able to assume wellformedness when querying/outputting, which would be utterly unworkable. The question, then, is whether cwb-encode should simply abort, or whether it should attempt to clean up the string into something that *can* be imported into the index.
> 
> The obvious answer is that this should be left to the user's preference. And, in line with this, with 8-bit charsets it *is* indeed possible to have cwb-encode insert a substitution character ("?") rather than abort upon encountering an invalid  byte. The -C flag turns cleanup on; the default behaviour is abort.  -C is especially useful for dealing with Windows-1252 incorrectly labelled as ISO 8859-1.
> 
> The problem: When I implemented this for ISO 8859, I found that an equivalent approach was not workable when dealing with UTF-8. (Assorted reasons, not all of which I recall as this was years ago, but two I remember: the validation function (Glib's g_utf8_validate) tells you where the invalid data *starts* but you then have to work out where it *ends*; plus you have to be careful about not changing the length of the string.) Thus for UTF-8, unlike other charsets, there is no cleanup option: only abort-on-bad-data is available.
> 
> However, looking back and mulling it over, I think I may now have thought of a way to get cleanup to work by incrementally overwriting invalid bytes with "?" and then revalidating. That would mean you'd get more than one "?" for a multi-byte bad character, but that is not necessarily a problem (it is invalid data, so how many characters it "really" represents is undefined). I will look into this when I get some time. Its feasibility depends on exactly how errors like [ef bf bf] get reported by g_utf8_validate.
> 
> Using U+FFFD instead of bytewise "?" would be very difficult without a 
> total rewrite of cwb-encode's input line handler (the string in 
> question can't change length because doing so would overwrite other 
> fields in the input line). Another consideration is compatibility with 
> the treatment of ISO-8859 input data. So I am going to stick with "?" 
> at least for the present. (Version 4 may be a different story.)
> 
> best
> 
> Andrew.
> 
> 
> -----Original Message-----
> From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] 
> On Behalf Of Serge Heiden
> Sent: 06 August 2014 15:43
> To: cwb at sslmit.unibo.it
> Subject: Re: [CWB] cwb-encode struggling with certain well-formed 
> noncharacter utf8 byte sequences
> 
> Hi Andrew,
> 
> Can you please be more explicit when you say "There is not any way to 
> override cwb's insistence on UTF-8 data being well formed"? (what 
> code/library does the work)
> 
> As I understand, Enrique has his own point of view of "documents [that] have been carefully checked for utf8 wellformedness".
> 
> As I read Section 16.7 of the Unicode standard:
> "Applications are free to use any of these noncharacter code points internally but should never attempt to exchange them. If a noncharacter is received in open interchange, an application is not required to interpret it in any way. It is good practice, however, to recognize it as a noncharacter and to take appropriate action, such as replacing it with U+FFFD replacement character, to indicate the problem in the text."
> 
> My suggestion:
> - replace such characters (including U+FFFF and U+10FFFF) by U+FFFD 
> and don't abort
> - maybe display a warning.
> 
> This should be considered to be Unicode compliant because the standard says it is good practice and would allow Enrique to import his corpus as is at he's own risks.
> 
> 
> Best,
> Serge
> 
> Le 06/08/2014 13:34, Hardie, Andrew a écrit :
>> Hi Enrique,
>>
>> The characters from U+fdd0 to U+fdef are reserved for "process-internal use": that is, whatever program is introducing these in its output is Doing Things Wrong and acting against an explicit mandate of the Unicode standard. So, the right thing here is either to remove those characters, or (better yet) fix whatever rogue program is introducing them.
>>
>> The last two characters on your list are U+FFFE and U+FFFF, which are guaranteed noncharacters (often handled as wrong-endian BOM and error code). For them to be appearing in your data files is very bad news indeed....
>>
>> There is not any way to override cwb's insistence on UTF-8 data being well formed (other than encoding the text as some other character set, which causes other problems) and it is not something you should want to do anyway.
>>
>> best
>>
>> Andrew.
>>
>> -----Original Message-----
>> From: cwb-bounces at sslmit.unibo.it 
>> [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Enrique Manjavacas
>> Sent: 06 August 2014 10:02
>> To: cwb at sslmit.unibo.it
>> Subject: [CWB] cwb-encode struggling with certain well-formed 
>> noncharacter utf8 byte sequences
>>
>> Hi,
>>
>> I have an issue when trying to encode a corpus with the utf8 charset option in version 3.4.7, The encoding process aborts with the message:
>>
>> Encoding error: an invalid byte or byte sequence for charset "utf8" was encountered.
>>
>> However, the documents have been carefully checked for utf8 wellformedness.
>> Inspecting the files has shown that the problematic codepoints are the noncharacters:
>>
>> ef b7 90
>> ef b7 93
>> ef b7 a1
>> ef b7 af
>> ef bf be
>> ef bf bf
>>
>> and filtering them before encoding resolves the issue, but still I was wondering whether there is some way of getting cwb-encode to accept such input.
>>
>> thanks!
>> Enrique
>>
>>
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it
>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it
>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> 
> 
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb