[CWB] unicode problems with Greek and OCS

Stefan Evert stefanML at collocations.de
Tue Mar 10 16:56:02 CET 2015


> Again -- it should be emphasised that this is a basic, fall-back aligner for when you have nothing better. It is not going to be terribly effective for Unicode data in different alphabets.

Precisely.  It would be nice to improve cwb-align so that it works with Unicode character n-grams, but before we add any major features, we'd have to rewrite the overoptimized programme code from scratch.  So it's not likely to happen anytime soon.

BTW, if anyone feels like contributing to CWB, a state-of-the-art sentence aligner as a replacement for cwb-align would be a nice thing to have!

However, despite all the shortcomings of cwb-align, you shouldn't get these error messages if both corpora contain valid UTF-8 strings.  They mean that the Glib function g_utf8_normalize() has rejected a string as invalid UTF-8.  If a recent version of cwb-encode accepted the data as UTF-8, then g_utf8_validate() and g_utf8_normalize() differ in their definition of valid UTF-8.

One possibility would be that g_utf8_validate() only checks that the byte sequence can be decoded, while g_utf8_normalize() barfs on unassigned code points or perhaps private use areas that it cannot deal with.  Ruprecht, any chance that the corpora contain weird code points that are not assigned in the Unicode standard (or have been assigned only very recently)?

Stefan



More information about the CWB mailing list