[CWB] Exporting an aligned corpus

Wed Nov 10 13:57:01 CET 2010

> I would like to process, programmatically, a parallel corpora. I can do this in two different ways:
> 
> - using some kind of C/Perl API that exports a cursor or iterator, that lets me look to each "sentence" pair at a time,

That's what I do in this situation (unless, of course, you have the original source data as sentence pairs and can use them directly).

It's very easy to do with the CWB/Perl API:

==============================
#!/bin/perl

use CWB::CL::Strict; # so we don't have to check for errors

$Source = "EUROPARL-EN";
$Target = "EUROPARL-FR";

$Cs = new CWB::CL::Corpus $Source;
$Ct = new CWB::CL::Corpus $Target;

$Align = $Cs->attribute(lc($Target), "a");
$n = $Align->max_alg;

$Ws = $Cs->attribute("word", "p"); # so we can print something
$Wt = $Ct->attribute("word", "p");

foreach $i (0 .. $n-1) {
  ($s1, $s2, $t1, $t2) = $Align->alg2cpos($i);
  @source_text = $Ws->cpos2str($s1 .. $s2);
  @target_text = $Wt->cpos2str($t1 .. $t2);
  print "SOURCE: @source_text\n";
  print "TARGET: @target_text\n";
  print "\n";
}
==============================

> - or perform a textual dump of the aligned corpora to any suitable format, and deal with that format on my application.

There's no built-in tool for exporting aligned text from CWB-encoded corpora, so you'd have to use something like the Perl script above in the first place.

I prefer to do everything directly with CWB/Perl then, since you have easy access to all annotation layers in the corpora.

Best,
Stefan