[CWB] Does cwb-align-encode support utf8?

Ray Wu liangpingwu at 126.com
Tue Jul 10 03:29:56 CEST 2012


Hi all,
Does anyone know whether cwb-align-encode support utf8 now? I'm running a problem when trying to align English with Chinese (encoded in utf8). It seems that I can query Chinese-English pairs but not reversely. The English-Chinese pairs do not give proper Chinese.

Ok, here is my scenario for your reference.

I have two toy corpora named "cn" and "en" respectively.
======================================
cn:
<a_cn_en id="cn_en_1">
我
是
一个
兵
。
</a_cn_en>
------------------------------------------------
en:
<a_cn_en id="cn_en_1">
I
am
a
soldier
.
</a_cn_en>


registry files:
-----------------------------------------------
cn:
NAME "cn"
ID   cn
HOME /home/ray/bilingual/cn
ATTRIBUTE word
STRUCTURE a_cn_en
----------------------------------------------
en:
NAME "en"
ID   en
HOME /home/ray/bilingual/en
ATTRIBUTE word
STRUCTURE a_cn_en
=====================================

I run the following step by step:
ray at ray-desktop:~$ export CORPUS_REGISTRY=/home/ray/bilingual/registry
ray at ray-desktop:~$ cwb-encode -c utf8 -d /home/ray/bilingual/cn -f /home/ray/bilingual/data/cn -R /home/ray/bilingual/registry/cn -S a_cn_en
Annotations of s-attribute <a_cn_en> not stored (file /home/ray/bilingual/data/cn, line #1, warning issued only once).

ray at ray-desktop:~$ cwb-encode -d /home/ray/bilingual/en -f /home/ray/bilingual/data/en -R /home/ray/bilingual/registry/en -S a_cn_en
Annotations of s-attribute <a_cn_en> not stored (file /home/ray/bilingual/data/en, line #1, warning issued only once).

ray at ray-desktop:~$ cwb-make -V EN
ray at ray-desktop:~$ cwb-make -V CN
ray at ray-desktop:~$ cwb-align -v -o out.align  CN EN a_cn_en
ray at ray-desktop:~$ cwb-align-show out.align 
Displaying alignment for [CN, EN] from file out.align
Enter 'h' for help.
>> p
1:1-alignment [0, 4] x [0, 4] (12)============================================

我 是 一个 兵 。                  I am a soldier .
>>

ray at ray-desktop:~$ cwb-align -v -o out2.align  EN CN a_cn_en  
ray at ray-desktop:~$ cwb-align-show out2.align
Displaying alignment for [EN, CN] from file out2.align
Enter 'h' for help.
>>
1:1-alignment [0, 4] x [0, 4] (12)============================================

I am a soldier .                        我 是 一个 兵 。
>>


I added the following line in /home/ray/bilingual/registry/en:
ALIGNED    cn
Similarly, in /home/ray/bilingual/registry/cn I added:
ALIGNED    en

ray at ray-desktop:~$ cwb-align-encode -D out.align
ray at ray-desktop:~$ cwb-align-encode -D out2.align

ray at ray-desktop:~$ cqp
[no corpus]> CN;
CN> show +en;
CN> "我";
        0:                           <我> 是 一个 兵 。
-->en: I am a soldier .
CN>

ray at ray-desktop:~$ cqp
[no corpus]> EN;
EN> show +cn;
EN> "I";
        0:                           <I> am a soldier .
-->cn: <88><91> <98>� <80>个 <85>� <80><82>
EN>

As you can see, the English-Chinese alignment doesn't yield proper Chinese.

My question is: is this a cwb-align-encode problem or a cqp problem? Thanks for any tips.



Best,
Ray
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20120710/95148bbb/attachment.htm


More information about the CWB mailing list