[CWB] unicode problems with Greek and OCS

Ruprecht von Waldenfels ruprecht.waldenfels at gmx.net
Tue Mar 10 15:42:41 CET 2015


All corpora are encoded as UTF8. This looks really strange. I tried 
different normalizations for unicode, namely NFKD, NFC, NFD, but all to 
no avail.

What are the ngrams for? There is no word alignment, so it's all about 
the alignment anchors - shouldn't they be independent of the character set?
Best,
Ruprecht


Am 10.03.2015 um 14:09 schrieb Stefan Evert:
> One case in which this would happen is if the _source_ corpus is UTF-8, but the target corpus has some other encoding.  cwb-align obtains the encoding from the source corpus and doesn't bother to check it against the target corpus.
>
> At first I thought that this might be due to the fact that the character n-gram features are in fact n-grams of bytes (so they cut out invalid UTF-8 sequences), but only the full strings are passed to cl_string_canonical().  See lines 287, 296, 509 and 527 in utils/feature_maps.c.
>
> Cheers,
> Stefan
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb



More information about the CWB mailing list