[CWB] unicode problems with Greek and OCS

Ruprecht von Waldenfels ruprecht.waldenfels at gmx.net
Tue Mar 10 15:50:57 CET 2015


So is the mistake non-fatal, or could one make it non-fatal? As you 
pointed out, this is a hopeless task.
Ruprech
Am 10.03.2015 um 15:46 schrieb Hardie, Andrew:
> The n-grams are for spotting corresponding words. As explained in the manfile, the program is designed for pairs like French-German where the alphabet is the same and there are at least a smattering of cognate words which will be similar if not identical.
>
> For Cyrillic vs Greek the n-grams buy you nothing.
>
> Andrew.
>
> -----Original Message-----
> From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Ruprecht von Waldenfels
> Sent: 10 March 2015 14:43
> To: cwb at sslmit.unibo.it
> Subject: Re: [CWB] unicode problems with Greek and OCS
>
> All corpora are encoded as UTF8. This looks really strange. I tried different normalizations for unicode, namely NFKD, NFC, NFD, but all to no avail.
>
> What are the ngrams for? There is no word alignment, so it's all about the alignment anchors - shouldn't they be independent of the character set?
> Best,
> Ruprecht
>
>
> Am 10.03.2015 um 14:09 schrieb Stefan Evert:
>> One case in which this would happen is if the _source_ corpus is UTF-8, but the target corpus has some other encoding.  cwb-align obtains the encoding from the source corpus and doesn't bother to check it against the target corpus.
>>
>> At first I thought that this might be due to the fact that the character n-gram features are in fact n-grams of bytes (so they cut out invalid UTF-8 sequences), but only the full strings are passed to cl_string_canonical().  See lines 287, 296, 509 and 527 in utils/feature_maps.c.
>>
>> Cheers,
>> Stefan
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it
>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb



More information about the CWB mailing list