[CWB] sentence-Aligned parallel corpus in CWB
Marc Reznicek
mreznice at ucm.es
Sun Feb 21 00:00:59 CET 2016
Hi Serge,
thanks a lot. Is it posible to nest segments inside each other like in the following example?
<seg id="1000001">
<seg id="1000002">
It
</s>
<seg id="1000003">
rains
</>
<seg id="1000004">
.
</s>
</s>
To model sentence and word alignment at the same time?
Best,
Marc
Marc Reznicek
DAAD
Prof. Visitante Lector
Facultad de Filología - Edificio D
Departamento de Filología Alemana
Universidad Complutense de Madrid
Planta 2 ° D-342
Av. Complutense
Ciudad Universitaria s/n
28040 Madrid
Tel: +34 91 394 7723
-----Ursprüngliche Nachricht-----
Von: Serge Sharoff [mailto:s.sharoff at leeds.ac.uk]
Gesendet: Donnerstag, 18. Februar 2016 19:00
An: cwb at sslmit.unibo.it
Cc: Marc Reznicek <mreznice at ucm.es>
Betreff: Re: [CWB] sentence-Aligned parallel corpus in CWB
Hi Marc,
In the absence of an official tutorial, the way I process parallel corpora in CWB is by creating two corpora with a special structural attribute with shared
ids:
<seg id="1000001">
Resumption NN resumption
of IN of
the DT the
session NN session
</seg>
<seg id="1000002">
I PP I
declare VVP declare
resumed VVD resume
the DT the
session NN session
...
<seg id="1000001">
Wiederaufnahme NN Wiederaufnahme
der ART d
Sitzungsperiode NN Sitzungsperiode
</seg>
<seg id="1000002">
Ich PPER ich
erkläre VVFIN erklären
die ART d
am APPRART am
Freitag NN Freitag
...
After that I run a script with the parameters corresponding to the names of these two corpora
> echo >>/usr/local/share/cwb/registry/$1 ALIGNED $2 cwb-align -V seg -o
> $1-$2.align $1 $2 seg 1>$1-$2.log cwb-align-encode -D $1-$2.align
> 1>>$1-$2.log echo >>/usr/local/share/cwb/registry/$2 ALIGNED $1
> cwb-align -V seg -o $2-$1.align $2 $1 seg 1>$2-$1.log cwb-align-encode
> -D $2-$1.align 1>>$2-$1.log
I hope this helps.
Best,
Serge
On Thursday 18 February 2016 2:25:19 PM Marc Reznicek wrote:
> Dear all,
>
>
>
> I am trying to convert a parallel sentence-aligned novel corpus to
> CWB. I have already compiled single language corpora but I have
> trouble finding information about the input format, conversion and
> querying in standard CQP concerning parallel corpora.
>
>
>
> Since there is EuroParl in CWB, there seems to be a way.
>
>
>
> I’d be helpful for any advice how to proceed.
>
>
>
>
>
> Marc Reznicek
>
> DAAD
>
> Prof. Visitante Lector
>
> Facultad de Filología - Edificio D
>
> Departamento de Filología Alemana
>
> Universidad Complutense de Madrid
>
>
>
> Planta 2 ° D-342
>
> Av. Complutense
>
> Ciudad Universitaria s/n
>
> 28040 Madrid
>
> Tel: +34 91 394 7723
More information about the CWB
mailing list