[CWB] Exporting an aligned corpus
Stefan Evert
stefanML at COLLOCATIONS.DE
Wed Nov 10 13:57:01 CET 2010
> I would like to process, programmatically, a parallel corpora. I can do this in two different ways:
>
> - using some kind of C/Perl API that exports a cursor or iterator, that lets me look to each "sentence" pair at a time,
That's what I do in this situation (unless, of course, you have the original source data as sentence pairs and can use them directly).
It's very easy to do with the CWB/Perl API:
==============================
#!/bin/perl
use CWB::CL::Strict; # so we don't have to check for errors
$Source = "EUROPARL-EN";
$Target = "EUROPARL-FR";
$Cs = new CWB::CL::Corpus $Source;
$Ct = new CWB::CL::Corpus $Target;
$Align = $Cs->attribute(lc($Target), "a");
$n = $Align->max_alg;
$Ws = $Cs->attribute("word", "p"); # so we can print something
$Wt = $Ct->attribute("word", "p");
foreach $i (0 .. $n-1) {
($s1, $s2, $t1, $t2) = $Align->alg2cpos($i);
@source_text = $Ws->cpos2str($s1 .. $s2);
@target_text = $Wt->cpos2str($t1 .. $t2);
print "SOURCE: @source_text\n";
print "TARGET: @target_text\n";
print "\n";
}
==============================
> - or perform a textual dump of the aligned corpora to any suitable format, and deal with that format on my application.
There's no built-in tool for exporting aligned text from CWB-encoded corpora, so you'd have to use something like the Perl script above in the first place.
I prefer to do everything directly with CWB/Perl then, since you have easy access to all annotation layers in the corpora.
Best,
Stefan
More information about the CWB
mailing list