[CWB] unicode problems with Greek and OCS
Ruprecht von Waldenfels
ruprecht.waldenfels at gmx.net
Tue Mar 10 15:42:41 CET 2015
All corpora are encoded as UTF8. This looks really strange. I tried
different normalizations for unicode, namely NFKD, NFC, NFD, but all to
no avail.
What are the ngrams for? There is no word alignment, so it's all about
the alignment anchors - shouldn't they be independent of the character set?
Best,
Ruprecht
Am 10.03.2015 um 14:09 schrieb Stefan Evert:
> One case in which this would happen is if the _source_ corpus is UTF-8, but the target corpus has some other encoding. cwb-align obtains the encoding from the source corpus and doesn't bother to check it against the target corpus.
>
> At first I thought that this might be due to the fact that the character n-gram features are in fact n-grams of bytes (so they cut out invalid UTF-8 sequences), but only the full strings are passed to cl_string_canonical(). See lines 287, 296, 509 and 527 in utils/feature_maps.c.
>
> Cheers,
> Stefan
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
More information about the CWB
mailing list