[CWB] unicode problems with Greek and OCS

Hardie, Andrew a.hardie at lancaster.ac.uk
Tue Mar 10 15:56:08 CET 2015


Well you could always tell the aligner not to use n-grams as a feature!

From the manfile:

       -1:<weight>
           Specifies that the appearance of shared one-letter sequences within words in the two possibly-equivalent regions should be used as features for the similarity measurement, with the specified weight.

           The configuration flags "-1, -2, -3, -4" all specify the use of letter sequences as features, and they all work in the same way; the following general comments apply to all four of these flags.

           Sub-word letter-sequence matching allows the presence of similar but not identical words to count as a factor in similarity. Such words are often orthogrpahic cognates that are likely to be translation equivalents and
           thus evidence that the pair of regions under analysis really are equivalent.  The longer the letter sequence, the more impressive the evidence (so you would normally weight "-4" more heavily than "-3", and so on; the
           default configuration (see below) does not include "-1" and "-2" at all).

           When letter saequences are compared, the comparison is case-insensitive and diacritic-insensitive.

           Only the letters "A" to "Z" are counted for the comparison; numbers, punctuation and any other symbol will be ignored. This means that the letter-sequence features are of no use at all, and should not be used, if either
           or both of the corpora is in a language that does not use the Latin alphabet.


...



       The default configuration (if no flags are specified) is "-C:1 -S:50:0.4 -3:3 -4:4".

By not specifying a configuration you are, ergo, asking the aligner to use 3grams and 4grams.

However, though that will solve your immediate problem, it doesn't solve the bug.

best

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Ruprecht von Waldenfels
Sent: 10 March 2015 14:51
To: cwb at sslmit.unibo.it
Subject: Re: [CWB] unicode problems with Greek and OCS

So is the mistake non-fatal, or could one make it non-fatal? As you pointed out, this is a hopeless task.
Ruprech
Am 10.03.2015 um 15:46 schrieb Hardie, Andrew:
> The n-grams are for spotting corresponding words. As explained in the manfile, the program is designed for pairs like French-German where the alphabet is the same and there are at least a smattering of cognate words which will be similar if not identical.
>
> For Cyrillic vs Greek the n-grams buy you nothing.
>
> Andrew.
>
> -----Original Message-----
> From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] 
> On Behalf Of Ruprecht von Waldenfels
> Sent: 10 March 2015 14:43
> To: cwb at sslmit.unibo.it
> Subject: Re: [CWB] unicode problems with Greek and OCS
>
> All corpora are encoded as UTF8. This looks really strange. I tried different normalizations for unicode, namely NFKD, NFC, NFD, but all to no avail.
>
> What are the ngrams for? There is no word alignment, so it's all about the alignment anchors - shouldn't they be independent of the character set?
> Best,
> Ruprecht
>
>
> Am 10.03.2015 um 14:09 schrieb Stefan Evert:
>> One case in which this would happen is if the _source_ corpus is UTF-8, but the target corpus has some other encoding.  cwb-align obtains the encoding from the source corpus and doesn't bother to check it against the target corpus.
>>
>> At first I thought that this might be due to the fact that the character n-gram features are in fact n-grams of bytes (so they cut out invalid UTF-8 sequences), but only the full strings are passed to cl_string_canonical().  See lines 287, 296, 509 and 527 in utils/feature_maps.c.
>>
>> Cheers,
>> Stefan
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it
>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list