[CWB] cwb-encode struggling with certain well-formed noncharacter utf8 byte sequences

Wed Aug 6 16:43:22 CEST 2014

Hi Andrew,

Can you please be more explicit when you say "There is not any way to 
override cwb's insistence on UTF-8 data being well formed"? (what 
code/library does the work)

As I understand, Enrique has his own point of view of "documents [that] 
have been carefully checked for utf8 wellformedness".

As I read Section 16.7 of the Unicode standard:
"Applications are free to use any of these noncharacter code points 
internally but should
never attempt to exchange them. If a noncharacter is received in open 
interchange, an
application is not required to interpret it in any way. It is good 
practice, however, to recognize
it as a noncharacter and to take appropriate action, such as replacing 
it with U+FFFD
replacement character, to indicate the problem in the text."

My suggestion:
- replace such characters (including U+FFFF and U+10FFFF) by U+FFFD and 
don't abort
- maybe display a warning.

This should be considered to be Unicode compliant because the standard 
says it is good practice
and would allow Enrique to import his corpus as is at he's own risks.

Best,
Serge

Le 06/08/2014 13:34, Hardie, Andrew a écrit :
> Hi Enrique,
>
> The characters from U+fdd0 to U+fdef are reserved for "process-internal use": that is, whatever program is introducing these in its output is Doing Things Wrong and acting against an explicit mandate of the Unicode standard. So, the right thing here is either to remove those characters, or (better yet) fix whatever rogue program is introducing them.
>
> The last two characters on your list are U+FFFE and U+FFFF, which are guaranteed noncharacters (often handled as wrong-endian BOM and error code). For them to be appearing in your data files is very bad news indeed....
>
> There is not any way to override cwb's insistence on UTF-8 data being well formed (other than encoding the text as some other character set, which causes other problems) and it is not something you should want to do anyway.
>
> best
>
> Andrew.
>
> -----Original Message-----
> From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Enrique Manjavacas
> Sent: 06 August 2014 10:02
> To: cwb at sslmit.unibo.it
> Subject: [CWB] cwb-encode struggling with certain well-formed noncharacter utf8 byte sequences
>
> Hi,
>
> I have an issue when trying to encode a corpus with the utf8 charset option in version 3.4.7, The encoding process aborts with the message:
>
> Encoding error: an invalid byte or byte sequence for charset "utf8" was encountered.
>
> However, the documents have been carefully checked for utf8 wellformedness.
> Inspecting the files has shown that the problematic codepoints are the noncharacters:
>
> ef b7 90
> ef b7 93
> ef b7 a1
> ef b7 af
> ef bf be
> ef bf bf
>
> and filtering them before encoding resolves the issue, but still I was wondering whether there is some way of getting cwb-encode to accept such input.
>
> thanks!
> Enrique
>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb

-- 
Dr. Serge Heiden, slh at ens-lyon.fr, http://textometrie.ens-lyon.fr
ENS de Lyon/CNRS - ICAR UMR5191, Institut de Linguistique Française
15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33622003883