[CWB] unicode problems with Greek and OCS

Ruprecht von Waldenfels ruprecht.waldenfels at gmx.net
Tue Mar 10 17:26:01 CET 2015


I can't be absolutely sure, but in the word forms that cropped up in the 
error messages, there were definetely such forms that did NOT include 
anything from the private use area. So no, I'd think this is not the case.

I THOUGHT it might be precomposed greek characters, but I ran the 
different normalizations and that didn't change anything.

Ruprecht

Am 10.03.2015 um 16:56 schrieb Stefan Evert:
>> Again -- it should be emphasised that this is a basic, fall-back aligner for when you have nothing better. It is not going to be terribly effective for Unicode data in different alphabets.
> Precisely.  It would be nice to improve cwb-align so that it works with Unicode character n-grams, but before we add any major features, we'd have to rewrite the overoptimized programme code from scratch.  So it's not likely to happen anytime soon.
>
> BTW, if anyone feels like contributing to CWB, a state-of-the-art sentence aligner as a replacement for cwb-align would be a nice thing to have!
>
> However, despite all the shortcomings of cwb-align, you shouldn't get these error messages if both corpora contain valid UTF-8 strings.  They mean that the Glib function g_utf8_normalize() has rejected a string as invalid UTF-8.  If a recent version of cwb-encode accepted the data as UTF-8, then g_utf8_validate() and g_utf8_normalize() differ in their definition of valid UTF-8.
>
> One possibility would be that g_utf8_validate() only checks that the byte sequence can be decoded, while g_utf8_normalize() barfs on unassigned code points or perhaps private use areas that it cannot deal with.  Ruprecht, any chance that the corpora contain weird code points that are not assigned in the Unicode standard (or have been assigned only very recently)?
>
> Stefan
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb



More information about the CWB mailing list