[CWB] unicode problems with Greek and OCS

Stefan Evert stefanML at collocations.de
Tue Mar 10 14:09:17 CET 2015


One case in which this would happen is if the _source_ corpus is UTF-8, but the target corpus has some other encoding.  cwb-align obtains the encoding from the source corpus and doesn't bother to check it against the target corpus.

At first I thought that this might be due to the fact that the character n-gram features are in fact n-grams of bytes (so they cut out invalid UTF-8 sequences), but only the full strings are passed to cl_string_canonical().  See lines 287, 296, 509 and 527 in utils/feature_maps.c.

Cheers,
Stefan


More information about the CWB mailing list