[CWB] Does cwb-align-encode support utf8?
Hardie, Andrew
a.hardie at lancaster.ac.uk
Thu Jul 12 15:20:03 CEST 2012
Hmmm. Ray’s initial problem was solved, but there does seem to be an underlying problem – arising from the English charset defaulting to Latin1 when unspecified, and the aligned Chinese data thus being treated as Latin1 for output. (Though I don’t understand why the Chinese data, once output, wasn’t treated as UTF8 by the terminal...)
There is some kind of issue here, however I’m not quite sure what the answer is.
· Clearly, it would be advantageous to allow alignment to be declared between two corpora that are in different charsets.
· However, that creates problems for display, since...
· ...it’s equally clearly undesirable for CQP to be outputting two charsets in the same chunk of output.
· Possible solutions:
o If the aligned chunk comes from a corpus with a different charset from the main corpus, recode it
§ (using iconv or something)
o If the aligned chunk comes from a corpus with a different charset from the main corpus, print an error message
o If the aligned chunk comes from a corpus with a different charset from the main corpus, print its position in the corpus
o Disallow alignment between corpora with different charsets.
Thoughts, everyone?
Andrew.
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Ray Wu
Sent: 10 July 2012 03:54
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Does cwb-align-encode support utf8?
Ah, I switched BOTH "en" and "cn" to utf8 and it works now. The moral seems to always use utf8 when dealing with CJK scripts.
--
Best,
Ray
At 2012-07-10 09:29:56,"Ray Wu" <liangpingwu at 126.com> wrote:
Hi all,
Does anyone know whether cwb-align-encode support utf8 now? I'm running a problem when trying to align English with Chinese (encoded in utf8). It seems that I can query Chinese-English pairs but not reversely. The English-Chinese pairs do not give proper Chinese.
Ok, here is my scenario for your reference.
I have two toy corpora named "cn" and "en" respectively.
======================================
cn:
<a_cn_en id="cn_en_1">
我
是
一个
兵
。
</a_cn_en>
------------------------------------------------
en:
<a_cn_en id="cn_en_1">
I
am
a
soldier
.
</a_cn_en>
registry files:
-----------------------------------------------
cn:
NAME "cn"
ID cn
HOME /home/ray/bilingual/cn
ATTRIBUTE word
STRUCTURE a_cn_en
----------------------------------------------
en:
NAME "en"
ID en
HOME /home/ray/bilingual/en
ATTRIBUTE word
STRUCTURE a_cn_en
=====================================
I run the following step by step:
ray at ray-desktop:~$ export CORPUS_REGISTRY=/home/ray/bilingual/registry
ray at ray-desktop:~$ cwb-encode -c utf8 -d /home/ray/bilingual/cn -f /home/ray/bilingual/data/cn -R /home/ray/bilingual/registry/cn -S a_cn_en
Annotations of s-attribute <a_cn_en> not stored (file /home/ray/bilingual/data/cn, line #1, warning issued only once).
ray at ray-desktop:~$ cwb-encode -d /home/ray/bilingual/en -f /home/ray/bilingual/data/en -R /home/ray/bilingual/registry/en -S a_cn_en
Annotations of s-attribute <a_cn_en> not stored (file /home/ray/bilingual/data/en, line #1, warning issued only once).
ray at ray-desktop:~$ cwb-make -V EN
ray at ray-desktop:~$ cwb-make -V CN
ray at ray-desktop:~$ cwb-align -v -o out.align CN EN a_cn_en
ray at ray-desktop:~$ cwb-align-show out.align
Displaying alignment for [CN, EN] from file out.align
Enter 'h' for help.
>> p
1:1-alignment [0, 4] x [0, 4] (12)============================================
我 是 一个 兵 。 I am a soldier .
>>
ray at ray-desktop:~$ cwb-align -v -o out2.align EN CN a_cn_en
ray at ray-desktop:~$ cwb-align-show out2.align
Displaying alignment for [EN, CN] from file out2.align
Enter 'h' for help.
>>
1:1-alignment [0, 4] x [0, 4] (12)============================================
I am a soldier . 我 是 一个 兵 。
>>
I added the following line in /home/ray/bilingual/registry/en:
ALIGNED cn
Similarly, in /home/ray/bilingual/registry/cn I added:
ALIGNED en
ray at ray-desktop:~$ cwb-align-encode -D out.align
ray at ray-desktop:~$ cwb-align-encode -D out2.align
ray at ray-desktop:~$ cqp
[no corpus]> CN;
CN> show +en;
CN> "我";
0: <我> 是 一个 兵 。
-->en: I am a soldier .
CN>
ray at ray-desktop:~$ cqp
[no corpus]> EN;
EN> show +cn;
EN> "I";
0: <I> am a soldier .
-->cn: <88><91> <98>� <80>个 <85>� <80><82>
EN>
As you can see, the English-Chinese alignment doesn't yield proper Chinese.
My question is: is this a cwb-align-encode problem or a cqp problem? Thanks for any tips.
Best,
Ray
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20120712/721aa800/attachment-0001.htm
More information about the CWB
mailing list