[CWB] unicode problems with Greek and OCS

Ruprecht von Waldenfels ruprecht.waldenfels at gmx.net
Tue Mar 10 16:22:05 CET 2015


I have to say, solving this problem would be a very good start!
However, I don't understand how to NOT specify these parameters. I've 
tried turning them to 0, but that doesn't help.
Best!
Ruprecht



Am 10.03.2015 um 15:56 schrieb Hardie, Andrew:
> Well you could always tell the aligner not to use n-grams as a feature!
>
>  From the manfile:
>
>         -1:<weight>
>             Specifies that the appearance of shared one-letter sequences within words in the two possibly-equivalent regions should be used as features for the similarity measurement, with the specified weight.
>
>             The configuration flags "-1, -2, -3, -4" all specify the use of letter sequences as features, and they all work in the same way; the following general comments apply to all four of these flags.
>
>             Sub-word letter-sequence matching allows the presence of similar but not identical words to count as a factor in similarity. Such words are often orthogrpahic cognates that are likely to be translation equivalents and
>             thus evidence that the pair of regions under analysis really are equivalent.  The longer the letter sequence, the more impressive the evidence (so you would normally weight "-4" more heavily than "-3", and so on; the
>             default configuration (see below) does not include "-1" and "-2" at all).
>
>             When letter saequences are compared, the comparison is case-insensitive and diacritic-insensitive.
>
>             Only the letters "A" to "Z" are counted for the comparison; numbers, punctuation and any other symbol will be ignored. This means that the letter-sequence features are of no use at all, and should not be used, if either
>             or both of the corpora is in a language that does not use the Latin alphabet.
>
>
> ...
>
>
>
>         The default configuration (if no flags are specified) is "-C:1 -S:50:0.4 -3:3 -4:4".
>
> By not specifying a configuration you are, ergo, asking the aligner to use 3grams and 4grams.
>
> However, though that will solve your immediate problem, it doesn't solve the bug.
>
> best
>
> Andrew.
>
> -----Original Message-----
> From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Ruprecht von Waldenfels
> Sent: 10 March 2015 14:51
> To: cwb at sslmit.unibo.it
> Subject: Re: [CWB] unicode problems with Greek and OCS
>
> So is the mistake non-fatal, or could one make it non-fatal? As you pointed out, this is a hopeless task.
> Ruprech
> Am 10.03.2015 um 15:46 schrieb Hardie, Andrew:
>> The n-grams are for spotting corresponding words. As explained in the manfile, the program is designed for pairs like French-German where the alphabet is the same and there are at least a smattering of cognate words which will be similar if not identical.
>>
>> For Cyrillic vs Greek the n-grams buy you nothing.
>>
>> Andrew.
>>
>> -----Original Message-----
>> From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it]
>> On Behalf Of Ruprecht von Waldenfels
>> Sent: 10 March 2015 14:43
>> To: cwb at sslmit.unibo.it
>> Subject: Re: [CWB] unicode problems with Greek and OCS
>>
>> All corpora are encoded as UTF8. This looks really strange. I tried different normalizations for unicode, namely NFKD, NFC, NFD, but all to no avail.
>>
>> What are the ngrams for? There is no word alignment, so it's all about the alignment anchors - shouldn't they be independent of the character set?
>> Best,
>> Ruprecht
>>
>>
>> Am 10.03.2015 um 14:09 schrieb Stefan Evert:
>>> One case in which this would happen is if the _source_ corpus is UTF-8, but the target corpus has some other encoding.  cwb-align obtains the encoding from the source corpus and doesn't bother to check it against the target corpus.
>>>
>>> At first I thought that this might be due to the fact that the character n-gram features are in fact n-grams of bytes (so they cut out invalid UTF-8 sequences), but only the full strings are passed to cl_string_canonical().  See lines 287, 296, 509 and 527 in utils/feature_maps.c.
>>>
>>> Cheers,
>>> Stefan
>>> _______________________________________________
>>> CWB mailing list
>>> CWB at sslmit.unibo.it
>>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it
>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it
>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb



More information about the CWB mailing list