[CWB] sentence-Aligned parallel corpus in CWB

Serge Sharoff s.sharoff at leeds.ac.uk
Thu Feb 18 19:00:19 CET 2016


Hi Marc,

In the absence of an official tutorial, the way I process parallel corpora in 
CWB is by creating two corpora with a special structural attribute with shared 
ids:
<seg id="1000001">
Resumption      NN      resumption
of      IN      of
the     DT      the
session NN      session
</seg>
<seg id="1000002">
I       PP      I
declare VVP     declare
resumed VVD     resume
the     DT      the
session NN      session
...

<seg id="1000001">
Wiederaufnahme  NN      Wiederaufnahme
der     ART     d
Sitzungsperiode NN      Sitzungsperiode
</seg>
<seg id="1000002">
Ich     PPER    ich
erkläre VVFIN   erklären
die     ART     d
am      APPRART am
Freitag NN      Freitag
...

After that I run a script with the parameters corresponding to the names of 
these two corpora

> echo >>/usr/local/share/cwb/registry/$1 ALIGNED $2
> cwb-align -V seg -o $1-$2.align $1 $2 seg 1>$1-$2.log
> cwb-align-encode -D $1-$2.align 1>>$1-$2.log
> echo >>/usr/local/share/cwb/registry/$2 ALIGNED $1
> cwb-align -V seg -o $2-$1.align $2 $1 seg 1>$2-$1.log
> cwb-align-encode -D $2-$1.align 1>>$2-$1.log

I hope this helps.

Best,
Serge

On Thursday 18 February 2016 2:25:19 PM Marc Reznicek wrote:
> Dear all,
> 
> 
> 
> I am trying to convert a parallel sentence-aligned novel corpus to CWB. I
> have already compiled single language corpora but I have trouble finding
> information about the input format, conversion and querying in standard CQP
> concerning parallel corpora.
> 
> 
> 
> Since there is EuroParl in CWB, there seems to be a way.
> 
> 
> 
> I’d be helpful for any advice how to proceed.
> 
> 
> 
> 
> 
> Marc Reznicek
> 
> DAAD
> 
> Prof. Visitante Lector
> 
> Facultad de Filología - Edificio D
> 
> Departamento de Filología Alemana
> 
> Universidad Complutense de Madrid
> 
> 
> 
> Planta 2 ° D-342
> 
> Av. Complutense
> 
> Ciudad Universitaria s/n
> 
> 28040 Madrid
> 
> Tel: +34 91 394 7723


More information about the CWB mailing list