[CWB] unicode problems with Greek and OCS

Hardie, Andrew a.hardie at lancaster.ac.uk
Tue Mar 10 15:46:48 CET 2015


The n-grams are for spotting corresponding words. As explained in the manfile, the program is designed for pairs like French-German where the alphabet is the same and there are at least a smattering of cognate words which will be similar if not identical.

For Cyrillic vs Greek the n-grams buy you nothing.

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Ruprecht von Waldenfels
Sent: 10 March 2015 14:43
To: cwb at sslmit.unibo.it
Subject: Re: [CWB] unicode problems with Greek and OCS

All corpora are encoded as UTF8. This looks really strange. I tried different normalizations for unicode, namely NFKD, NFC, NFD, but all to no avail.

What are the ngrams for? There is no word alignment, so it's all about the alignment anchors - shouldn't they be independent of the character set?
Best,
Ruprecht


Am 10.03.2015 um 14:09 schrieb Stefan Evert:
> One case in which this would happen is if the _source_ corpus is UTF-8, but the target corpus has some other encoding.  cwb-align obtains the encoding from the source corpus and doesn't bother to check it against the target corpus.
>
> At first I thought that this might be due to the fact that the character n-gram features are in fact n-grams of bytes (so they cut out invalid UTF-8 sequences), but only the full strings are passed to cl_string_canonical().  See lines 287, 296, 509 and 527 in utils/feature_maps.c.
>
> Cheers,
> Stefan
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list