[CWB] sentence-Aligned parallel corpus in CWB
Serge Sharoff
s.sharoff at leeds.ac.uk
Thu Feb 18 19:00:19 CET 2016
Hi Marc,
In the absence of an official tutorial, the way I process parallel corpora in
CWB is by creating two corpora with a special structural attribute with shared
ids:
<seg id="1000001">
Resumption NN resumption
of IN of
the DT the
session NN session
</seg>
<seg id="1000002">
I PP I
declare VVP declare
resumed VVD resume
the DT the
session NN session
...
<seg id="1000001">
Wiederaufnahme NN Wiederaufnahme
der ART d
Sitzungsperiode NN Sitzungsperiode
</seg>
<seg id="1000002">
Ich PPER ich
erkläre VVFIN erklären
die ART d
am APPRART am
Freitag NN Freitag
...
After that I run a script with the parameters corresponding to the names of
these two corpora
> echo >>/usr/local/share/cwb/registry/$1 ALIGNED $2
> cwb-align -V seg -o $1-$2.align $1 $2 seg 1>$1-$2.log
> cwb-align-encode -D $1-$2.align 1>>$1-$2.log
> echo >>/usr/local/share/cwb/registry/$2 ALIGNED $1
> cwb-align -V seg -o $2-$1.align $2 $1 seg 1>$2-$1.log
> cwb-align-encode -D $2-$1.align 1>>$2-$1.log
I hope this helps.
Best,
Serge
On Thursday 18 February 2016 2:25:19 PM Marc Reznicek wrote:
> Dear all,
>
>
>
> I am trying to convert a parallel sentence-aligned novel corpus to CWB. I
> have already compiled single language corpora but I have trouble finding
> information about the input format, conversion and querying in standard CQP
> concerning parallel corpora.
>
>
>
> Since there is EuroParl in CWB, there seems to be a way.
>
>
>
> Id be helpful for any advice how to proceed.
>
>
>
>
>
> Marc Reznicek
>
> DAAD
>
> Prof. Visitante Lector
>
> Facultad de Filología - Edificio D
>
> Departamento de Filología Alemana
>
> Universidad Complutense de Madrid
>
>
>
> Planta 2 ° D-342
>
> Av. Complutense
>
> Ciudad Universitaria s/n
>
> 28040 Madrid
>
> Tel: +34 91 394 7723
More information about the CWB
mailing list