[CWB] cwb-encode struggling with certain well-formed noncharacter utf8 byte sequences

Hardie, Andrew a.hardie at lancaster.ac.uk
Wed Aug 6 18:19:52 CEST 2014


Hi Serge,

>> As I understand, Enrique has his own point of view of "documents [that] have been carefully checked for utf8 wellformedness".

This is not an issue that admits of multiple points of view. Either it's wellformed or it's not. If it contains [ef bf bf], then it's not. 

The question that admits of multiple points of view is, rather, what is an appropriate way to deal with *any* invalid sequence (whether it be a nicely-encoded-but-illegal character, or a bunch of badly-encoded bytes) when encountered. See below.

>> Can you please be more explicit when you say "There is not any way to override cwb's insistence on UTF-8 data being well formed"?

I mean the *user* has no way to override this. All input strings are checked for UTF-8 validity and if they are not valid then there is an abort. 

The story behind this: we simply don't let users import invalid UTF-8 into an indexed corpus. If we did, we would not be able to assume wellformedness when querying/outputting, which would be utterly unworkable. The question, then, is whether cwb-encode should simply abort, or whether it should attempt to clean up the string into something that *can* be imported into the index.

The obvious answer is that this should be left to the user's preference. And, in line with this, with 8-bit charsets it *is* indeed possible to have cwb-encode insert a substitution character ("?") rather than abort upon encountering an invalid  byte. The -C flag turns cleanup on; the default behaviour is abort.  -C is especially useful for dealing with Windows-1252 incorrectly labelled as ISO 8859-1.

The problem: When I implemented this for ISO 8859, I found that an equivalent approach was not workable when dealing with UTF-8. (Assorted reasons, not all of which I recall as this was years ago, but two I remember: the validation function (Glib's g_utf8_validate) tells you where the invalid data *starts* but you then have to work out where it *ends*; plus you have to be careful about not changing the length of the string.) Thus for UTF-8, unlike other charsets, there is no cleanup option: only abort-on-bad-data is available.

However, looking back and mulling it over, I think I may now have thought of a way to get cleanup to work by incrementally overwriting invalid bytes with "?" and then revalidating. That would mean you'd get more than one "?" for a multi-byte bad character, but that is not necessarily a problem (it is invalid data, so how many characters it "really" represents is undefined). I will look into this when I get some time. Its feasibility depends on exactly how errors like [ef bf bf] get reported by g_utf8_validate.

Using U+FFFD instead of bytewise "?" would be very difficult without a total rewrite of cwb-encode's input line handler (the string in question can't change length because doing so would overwrite other fields in the input line). Another consideration is compatibility with the treatment of ISO-8859 input data. So I am going to stick with "?" at least for the present. (Version 4 may be a different story.)

best

Andrew.


-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Serge Heiden
Sent: 06 August 2014 15:43
To: cwb at sslmit.unibo.it
Subject: Re: [CWB] cwb-encode struggling with certain well-formed noncharacter utf8 byte sequences

Hi Andrew,

Can you please be more explicit when you say "There is not any way to override cwb's insistence on UTF-8 data being well formed"? (what code/library does the work)

As I understand, Enrique has his own point of view of "documents [that] have been carefully checked for utf8 wellformedness".

As I read Section 16.7 of the Unicode standard:
"Applications are free to use any of these noncharacter code points internally but should never attempt to exchange them. If a noncharacter is received in open interchange, an application is not required to interpret it in any way. It is good practice, however, to recognize it as a noncharacter and to take appropriate action, such as replacing it with U+FFFD replacement character, to indicate the problem in the text."

My suggestion:
- replace such characters (including U+FFFF and U+10FFFF) by U+FFFD and don't abort
- maybe display a warning.

This should be considered to be Unicode compliant because the standard says it is good practice and would allow Enrique to import his corpus as is at he's own risks.


Best,
Serge

Le 06/08/2014 13:34, Hardie, Andrew a écrit :
> Hi Enrique,
>
> The characters from U+fdd0 to U+fdef are reserved for "process-internal use": that is, whatever program is introducing these in its output is Doing Things Wrong and acting against an explicit mandate of the Unicode standard. So, the right thing here is either to remove those characters, or (better yet) fix whatever rogue program is introducing them.
>
> The last two characters on your list are U+FFFE and U+FFFF, which are guaranteed noncharacters (often handled as wrong-endian BOM and error code). For them to be appearing in your data files is very bad news indeed....
>
> There is not any way to override cwb's insistence on UTF-8 data being well formed (other than encoding the text as some other character set, which causes other problems) and it is not something you should want to do anyway.
>
> best
>
> Andrew.
>
> -----Original Message-----
> From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Enrique Manjavacas
> Sent: 06 August 2014 10:02
> To: cwb at sslmit.unibo.it
> Subject: [CWB] cwb-encode struggling with certain well-formed noncharacter utf8 byte sequences
>
> Hi,
>
> I have an issue when trying to encode a corpus with the utf8 charset option in version 3.4.7, The encoding process aborts with the message:
>
> Encoding error: an invalid byte or byte sequence for charset "utf8" was encountered.
>
> However, the documents have been carefully checked for utf8 wellformedness.
> Inspecting the files has shown that the problematic codepoints are the noncharacters:
>
> ef b7 90
> ef b7 93
> ef b7 a1
> ef b7 af
> ef bf be
> ef bf bf
>
> and filtering them before encoding resolves the issue, but still I was wondering whether there is some way of getting cwb-encode to accept such input.
>
> thanks!
> Enrique
>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb


-- 
Dr. Serge Heiden, slh at ens-lyon.fr, http://textometrie.ens-lyon.fr
ENS de Lyon/CNRS - ICAR UMR5191, Institut de Linguistique Française
15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33622003883

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list