[CWB] Output from cwb-align

Stefan Evert stefanML at collocations.de
Tue Sep 28 10:39:13 CEST 2010


On 24 Sep 2010, at 11:16, Gabriele Brandolini wrote:

> I’ve tried to align bilingual corpora (Latin-Swahili) by using cwb-align and some very useful and clear instructions given by Stefan. Thanks to him!
> 
> I got some good output, and I’m now checking it.
> 
>  
> Please, someone could tell me if it’s possible, and how, to save the aligned file in a readable format, say txt, each line having one <source_aligned_sentence(s)>TAB (or something else)<target_aligned_sentence(s)> ?

I'm afraid there isn't a ready-made tool that does exactly what you need.  It would be relatively straightforward to write such a program in Perl, if you've got the CWB::CL Perl module installed.

Another option is to encode the alignment for use with CQP and then do something like the following in CQP -- e.g. for EUROPARL-EN aligned with EUROPARL-FR:

EUROPARL-EN;
Sents = <s> [] :EUROPARL-FR [];
show -cpos +europarl-fr;
set ld "";
set rd "";
set Context europarl-fr;
cat Sents > "alignment-pairs.txt";

This will print alignment pairs in two consecutive lines (instead of separated by a TAB), where the second line is always marked with "-->europar-fr:".  It should be easy to convert this file into the format you need.  However, there will be some duplicates whenever multiple sentences in the source language form a single alignment block (there's no way around this at the moment, I'm afraid).

Hope this helps a bit,
Stefan




More information about the CWB mailing list