[CWB] Exporting an aligned corpus

Alberto Simões ambs at di.uminho.pt
Wed Nov 10 15:07:35 CET 2010


Hello

Thank you for the Perl code snipped. It will be very useful ;)
And, yes, I also prefer to process the corpus directly from CWB than 
exporting it :)

Cheers

On 10/11/2010 12:57, Stefan Evert wrote:
>> I would like to process, programmatically, a parallel corpora. I can do this in two different ways:
>>
>> - using some kind of C/Perl API that exports a cursor or iterator, that lets me look to each "sentence" pair at a time,
>
> That's what I do in this situation (unless, of course, you have the original source data as sentence pairs and can use them directly).
>
> It's very easy to do with the CWB/Perl API:
>
> ==============================
> #!/bin/perl
>
> use CWB::CL::Strict; # so we don't have to check for errors
>
> $Source = "EUROPARL-EN";
> $Target = "EUROPARL-FR";
>
> $Cs = new CWB::CL::Corpus $Source;
> $Ct = new CWB::CL::Corpus $Target;
>
> $Align = $Cs->attribute(lc($Target), "a");
> $n = $Align->max_alg;
>
> $Ws = $Cs->attribute("word", "p"); # so we can print something
> $Wt = $Ct->attribute("word", "p");
>
> foreach $i (0 .. $n-1) {
>    ($s1, $s2, $t1, $t2) = $Align->alg2cpos($i);
>    @source_text = $Ws->cpos2str($s1 .. $s2);
>    @target_text = $Wt->cpos2str($t1 .. $t2);
>    print "SOURCE: @source_text\n";
>    print "TARGET: @target_text\n";
>    print "\n";
> }
> ==============================
>
>> - or perform a textual dump of the aligned corpora to any suitable format, and deal with that format on my application.
>
>
> There's no built-in tool for exporting aligned text from CWB-encoded corpora, so you'd have to use something like the Perl script above in the first place.
>
> I prefer to do everything directly with CWB/Perl then, since you have easy access to all annotation layers in the corpora.
>
> Best,
> Stefan
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb

-- 
Alberto Simões


More information about the CWB mailing list