[CWB] unicode problems with Greek and OCS

Hardie, Andrew a.hardie at lancaster.ac.uk
Tue Mar 10 18:34:38 CET 2015


cl_string_canonical declares an invalid UTF-8 string if g_utf8_normalize() returns NULL; the GLib documentation (https://developer.gnome.org/glib/2.42/glib-Unicode-Manipulation.html#g-utf8-normalize ) says this only happens if the string is invalid UTF-8.

I checked every string in the lexicon that Ruprecht sent me for UTF-8 validity; no errors, as expected.

I have taken a look at the relevant GLib code. It is not easy to see what is going on. But I THINK that g_utf8_normalize() does not actually validate the string. Instead, it depends on an error code passed by a different function, g_utf8_get_char(), which returns a UCS4 codepoint. The error output of this function is given as "undefined" but the code shows it is actually (uint32)-1 i.e. 0xFFFFFFFFFFFFFFFF. This is then re-coded into the UTF-8 output using  g_ucs4_to_utf8(), which aborts on an "out of range" error since this is not a valid Unicode code point. That means that any bad input is automatically caught on the output.

By contrast, g_utf8_validate actually runs on the utf8 string, without recoding to UCS4.

I am continuing to ponder...

A

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Ruprecht von Waldenfels
Sent: 10 March 2015 16:26
To: cwb at sslmit.unibo.it
Subject: Re: [CWB] unicode problems with Greek and OCS

I can't be absolutely sure, but in the word forms that cropped up in the error messages, there were definetely such forms that did NOT include anything from the private use area. So no, I'd think this is not the case.

I THOUGHT it might be precomposed greek characters, but I ran the different normalizations and that didn't change anything.

Ruprecht

Am 10.03.2015 um 16:56 schrieb Stefan Evert:
>> Again -- it should be emphasised that this is a basic, fall-back aligner for when you have nothing better. It is not going to be terribly effective for Unicode data in different alphabets.
> Precisely.  It would be nice to improve cwb-align so that it works with Unicode character n-grams, but before we add any major features, we'd have to rewrite the overoptimized programme code from scratch.  So it's not likely to happen anytime soon.
>
> BTW, if anyone feels like contributing to CWB, a state-of-the-art sentence aligner as a replacement for cwb-align would be a nice thing to have!
>
> However, despite all the shortcomings of cwb-align, you shouldn't get these error messages if both corpora contain valid UTF-8 strings.  They mean that the Glib function g_utf8_normalize() has rejected a string as invalid UTF-8.  If a recent version of cwb-encode accepted the data as UTF-8, then g_utf8_validate() and g_utf8_normalize() differ in their definition of valid UTF-8.
>
> One possibility would be that g_utf8_validate() only checks that the byte sequence can be decoded, while g_utf8_normalize() barfs on unassigned code points or perhaps private use areas that it cannot deal with.  Ruprecht, any chance that the corpora contain weird code points that are not assigned in the Unicode standard (or have been assigned only very recently)?
>
> Stefan
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list