[CWB] Output from cwb-align
Stefan Evert
stefanML at collocations.de
Tue Sep 28 10:39:13 CEST 2010
On 24 Sep 2010, at 11:16, Gabriele Brandolini wrote:
> I’ve tried to align bilingual corpora (Latin-Swahili) by using cwb-align and some very useful and clear instructions given by Stefan. Thanks to him!
>
> I got some good output, and I’m now checking it.
>
>
> Please, someone could tell me if it’s possible, and how, to save the aligned file in a readable format, say txt, each line having one <source_aligned_sentence(s)>TAB (or something else)<target_aligned_sentence(s)> ?
I'm afraid there isn't a ready-made tool that does exactly what you need. It would be relatively straightforward to write such a program in Perl, if you've got the CWB::CL Perl module installed.
Another option is to encode the alignment for use with CQP and then do something like the following in CQP -- e.g. for EUROPARL-EN aligned with EUROPARL-FR:
EUROPARL-EN;
Sents = <s> [] :EUROPARL-FR [];
show -cpos +europarl-fr;
set ld "";
set rd "";
set Context europarl-fr;
cat Sents > "alignment-pairs.txt";
This will print alignment pairs in two consecutive lines (instead of separated by a TAB), where the second line is always marked with "-->europar-fr:". It should be easy to convert this file into the format you need. However, there will be some duplicates whenever multiple sentences in the source language form a single alignment block (there's no way around this at the moment, I'm afraid).
Hope this helps a bit,
Stefan
More information about the CWB
mailing list