[CWB] A question about the aligning using cwb-encoding

Ray Wu liangpingwu at 126.com
Sat Feb 15 10:05:06 CET 2014


Hi all,
Andrew is right. We made no modification to the code and simply used the translation-visualisation feature. It can be achieved like this:


Step 1: Prepare a CQPweb-compatible corpus file “test.txt” (in utf-8 format):
<text id="test">
<s trans="The original language">
The
translated
text
.
</s>
</text>

Step 2: When installing a new corpus, go to configure the corpus by specifying the info as required by “S-attributes (XML elements) -> Use custom setup” as:
0+trans
(NB: Specify “P-attributes” as necessary if your corpus is different from mine.)
 
Step 3: When everything done, go to “Manage visualisations->  Free translation -> Select XML element/attribute to get the translation from” and choose “s_trans” to provide whole-sentence translation.

Although it works, it certainly lacks some features provided by cwb-align, for instance, it doesn't support the alignment of more than two languages. We are still finding ways to address this issue.

Best,
Ray


At 2014-02-14 04:41:09,"Hardie, Andrew" <a.hardie at lancaster.ac.uk> wrote:
It looks to me like they are using the translation-visualisation feature. This is really designed for interlinear field data, where you would have the original language as the word p-attribute, the morpheme gloss as the primary annotation p-attribute, and the free translation as an annotated s-attribute. However, I built it in such a way that you can turn on translations without glossing. I think that's what they've done, putting one corpus into the XML of the other. No reason why others shouldn't be able to use the same trick.


Worth noting once again that I never actually finished work on the advanced visualisations.


Best


Andrew.



"Josep M. Fontana" <josepm.fontana at upf.edu> wrote:




>>> Is it possible right now to use the CQPweb interface to exploit parallel corpora?
>>> The question is: is the future here already?
> No.
>
> This is still planned, but I have not had time to do it yet.

OK, so this means that the people who did this had to do quite a bit of
hacking:

http://124.193.83.252/cqp/

If you notice, at the end there are a few parallel corpora. Now the
access is restricted but I had been able to access and it really seemed
to work well.

JM



>
> best
>
> Andrew.
>
> -----Original Message-----
> From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Josep M. Fontana
> Sent: 13 February 2014 17:11
> To: cwb at sslmit.unibo.it
> Subject: Re: [CWB] A question about the aligning using cwb-encoding
>
> I just found this old thread on alignment and this reminded me of something that I had wanted to ask for a while. Is it possible right now to use the CQPweb interface to exploit parallel corpora? We have parallel corpora from translations between different languages (so the alignment is already done) but these are using a very problematic and proprietary interface. We would like to move all of our corpora to the best web interface there is, CQPweb, of course :-)
>
> I found a paper written by Andrew
> (http://www.lancaster.ac.uk/people/hardiea/cqpweb-paper.pdf) where he talks about using CQPweb with parallel corpora but as something he was planning for the future: "Other planned extensions remain to be
> implemented: support for concordancing across parallel corpora;".
>
> The question is: is the future here already?
>
> JM
>>> Some first sentences were aligned as right pairs.
>>> But the others were not.
>>> It seems to be related with statistical aligning process.
>> You're absolutely right.  cwb-align isn't a particularly sophisticated sentence aligner, so it's likely to get some cases wrong.  You may be seeing particularly bad performance if you're using the default parameter settings, which are intended for related languages and are based on sentence length (in characters), character n-gram counts and identical words.
>>
>> For Korean-English alignment, the best solution might be to get a good bilingual word list and use that as the only feature (dropping even sentence length).
>>
>>> Actually I made two corpora so, that every pair sentence should have the same sentence id like <s id="100"> or <s id="10000">, in order to avoid the failure of statistical alignment.
>>> I am working with 60000 sentences. And I manually aligned all sentences and put the information into the xml tag "s_id".
>>>
>>> My question is how I can make useful the manually created xml tag "s_id"?
>> If these are only 1:1 alignments, you can use a trick to smuggle them past cwb-align:
>>
>>       cwb-align -V s_id -o alignment.txt CORPUS1 CORPUS2 s -C:1
>>
>> With "-V s_id", the manually aligned sentence pairs are taken as a pre-alignment, and the statistical aligner is only run within each pair of pre-aligned regions.  Since each of those contains just a single sentence pair, it cannot further break up the bead, so the original pre-aligment is passed through.  Feature specs shouldn't matter here, so you might as well just specify -C:1 to avoid unnecessary overhead.  You can then proceed to cwb-align-encode the generated file alignment.txt as usual.
>>
>> If you have more complex alignments (n:1 or 1:n, 2:2, ...), you could add new XML regions, e.g.
>>
>>       <bead id="100"> ... </bead>
>>
>> and use -V bead_id for the pre-alignment in cwb-align.
>>
>>
>> If you have a recent version of the CWB/Perl interface, the best strategy is to use the cwb-align-import tool.  You'll have to provide a separate alignment file that lists the sentence IDs in source and target corpus for each alignment bead.  Complex alignments require no special treatment with this tool.  See "perldoc cwb-align-import" for usage and format details.
>>
>>
>> Best,
>> Stefan
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it
>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20140215/16a9ba52/attachment.html>


More information about the CWB mailing list