[CWB] Does cwb-align-encode support utf8?

Thu Jul 12 15:20:03 CEST 2012

Hmmm. Ray’s initial problem was solved, but there does seem to be an underlying problem – arising from the English charset defaulting to Latin1 when unspecified, and the aligned Chinese data thus being treated as Latin1 for output.  (Though I don’t understand why the Chinese data, once output, wasn’t treated as UTF8 by the terminal...)

There is some kind of issue here, however I’m not quite sure what the answer is.

·         Clearly, it would be advantageous to allow alignment to be declared between two corpora that are in different charsets.

·         However, that creates problems for display, since...

·         ...it’s equally clearly undesirable for CQP to be outputting two charsets in the same chunk of output.

·         Possible solutions:

o    If the aligned chunk comes from a corpus with a different charset from the main corpus, recode it

§  (using iconv or something)

o    If the aligned chunk comes from a corpus with a different charset from the main corpus, print an error message

o    If the aligned chunk comes from a corpus with a different charset from the main corpus, print its position in the corpus

o    Disallow alignment between corpora with different charsets.

Thoughts, everyone?

Andrew.

From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Ray Wu
Sent: 10 July 2012 03:54
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Does cwb-align-encode support utf8?

Ah, I switched BOTH  "en" and "cn" to utf8 and it works now. The moral seems to always use utf8 when dealing with CJK scripts.
--
Best,
Ray

At 2012-07-10 09:29:56,"Ray Wu" <liangpingwu at 126.com> wrote:

Hi all,
Does anyone know whether cwb-align-encode support utf8 now? I'm running a problem when trying to align English with Chinese (encoded in utf8). It seems that I can query Chinese-English pairs but not reversely. The English-Chinese pairs do not give proper Chinese.

Ok, here is my scenario for your reference.

I have two toy corpora named "cn" and "en" respectively.
======================================
cn:
<a_cn_en id="cn_en_1">
我
是
一个
兵
。
</a_cn_en>
------------------------------------------------
en:
<a_cn_en id="cn_en_1">
I
am
a
soldier
.
</a_cn_en>

registry files:
-----------------------------------------------
cn:
NAME "cn"
ID   cn
HOME /home/ray/bilingual/cn
ATTRIBUTE word
STRUCTURE a_cn_en
----------------------------------------------
en:
NAME "en"
ID   en
HOME /home/ray/bilingual/en
ATTRIBUTE word
STRUCTURE a_cn_en
=====================================

I run the following step by step:
ray at ray-desktop:~$ export CORPUS_REGISTRY=/home/ray/bilingual/registry
ray at ray-desktop:~$ cwb-encode -c utf8 -d /home/ray/bilingual/cn -f /home/ray/bilingual/data/cn -R /home/ray/bilingual/registry/cn -S a_cn_en
Annotations of s-attribute <a_cn_en> not stored (file /home/ray/bilingual/data/cn, line #1, warning issued only once).

ray at ray-desktop:~$ cwb-encode -d /home/ray/bilingual/en -f /home/ray/bilingual/data/en -R /home/ray/bilingual/registry/en -S a_cn_en
Annotations of s-attribute <a_cn_en> not stored (file /home/ray/bilingual/data/en, line #1, warning issued only once).

ray at ray-desktop:~$ cwb-make -V EN
ray at ray-desktop:~$ cwb-make -V CN
ray at ray-desktop:~$ cwb-align -v -o out.align  CN EN a_cn_en
ray at ray-desktop:~$ cwb-align-show out.align
Displaying alignment for [CN, EN] from file out.align
Enter 'h' for help.
>> p
1:1-alignment [0, 4] x [0, 4] (12)============================================

我 是 一个 兵 。                  I am a soldier .
>>

ray at ray-desktop:~$ cwb-align -v -o out2.align  EN CN a_cn_en
ray at ray-desktop:~$ cwb-align-show out2.align
Displaying alignment for [EN, CN] from file out2.align
Enter 'h' for help.
>>
1:1-alignment [0, 4] x [0, 4] (12)============================================

I am a soldier .                        我 是 一个 兵 。
>>

I added the following line in /home/ray/bilingual/registry/en:
ALIGNED    cn
Similarly, in /home/ray/bilingual/registry/cn I added:
ALIGNED    en

ray at ray-desktop:~$ cwb-align-encode -D out.align
ray at ray-desktop:~$ cwb-align-encode -D out2.align

ray at ray-desktop:~$ cqp
[no corpus]> CN;
CN> show +en;
CN> "我";
        0:                           <我> 是 一个 兵 。
-->en: I am a soldier .
CN>

ray at ray-desktop:~$ cqp
[no corpus]> EN;
EN> show +cn;
EN> "I";
        0:                           <I> am a soldier .
-->cn: <88><91> <98>� <80>个 <85>� <80><82>
EN>

As you can see, the English-Chinese alignment doesn't yield proper Chinese.

My question is: is this a cwb-align-encode problem or a cqp problem? Thanks for any tips.

Best,
Ray

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20120712/721aa800/attachment-0001.htm