[CWB] sentence-Aligned parallel corpus in CWB

Marc Reznicek mreznice at ucm.es
Sun Feb 21 00:00:59 CET 2016


Hi Serge,

thanks a lot. Is it posible to nest segments inside each other like in the following example?

<seg id="1000001">
	<seg id="1000002">
		It
	</s>
	<seg id="1000003">
		rains
	</>
	<seg id="1000004">
		.
	</s>
</s>

To model sentence and word alignment at the same time?
Best,

Marc

Marc Reznicek
DAAD
Prof. Visitante Lector
Facultad de Filología - Edificio D
Departamento de Filología Alemana
Universidad Complutense de Madrid

Planta 2 ° D-342
Av. Complutense
Ciudad Universitaria s/n
28040 Madrid
Tel: +34 91 394 7723

-----Ursprüngliche Nachricht-----
Von: Serge Sharoff [mailto:s.sharoff at leeds.ac.uk] 
Gesendet: Donnerstag, 18. Februar 2016 19:00
An: cwb at sslmit.unibo.it
Cc: Marc Reznicek <mreznice at ucm.es>
Betreff: Re: [CWB] sentence-Aligned parallel corpus in CWB

Hi Marc,

In the absence of an official tutorial, the way I process parallel corpora in CWB is by creating two corpora with a special structural attribute with shared
ids:
<seg id="1000001">
Resumption      NN      resumption
of      IN      of
the     DT      the
session NN      session
</seg>
<seg id="1000002">
I       PP      I
declare VVP     declare
resumed VVD     resume
the     DT      the
session NN      session
...

<seg id="1000001">
Wiederaufnahme  NN      Wiederaufnahme
der     ART     d
Sitzungsperiode NN      Sitzungsperiode
</seg>
<seg id="1000002">
Ich     PPER    ich
erkläre VVFIN   erklären
die     ART     d
am      APPRART am
Freitag NN      Freitag
...

After that I run a script with the parameters corresponding to the names of these two corpora

> echo >>/usr/local/share/cwb/registry/$1 ALIGNED $2 cwb-align -V seg -o 
> $1-$2.align $1 $2 seg 1>$1-$2.log cwb-align-encode -D $1-$2.align 
> 1>>$1-$2.log echo >>/usr/local/share/cwb/registry/$2 ALIGNED $1 
> cwb-align -V seg -o $2-$1.align $2 $1 seg 1>$2-$1.log cwb-align-encode 
> -D $2-$1.align 1>>$2-$1.log

I hope this helps.

Best,
Serge

On Thursday 18 February 2016 2:25:19 PM Marc Reznicek wrote:
> Dear all,
> 
> 
> 
> I am trying to convert a parallel sentence-aligned novel corpus to 
> CWB. I have already compiled single language corpora but I have 
> trouble finding information about the input format, conversion and 
> querying in standard CQP concerning parallel corpora.
> 
> 
> 
> Since there is EuroParl in CWB, there seems to be a way.
> 
> 
> 
> I’d be helpful for any advice how to proceed.
> 
> 
> 
> 
> 
> Marc Reznicek
> 
> DAAD
> 
> Prof. Visitante Lector
> 
> Facultad de Filología - Edificio D
> 
> Departamento de Filología Alemana
> 
> Universidad Complutense de Madrid
> 
> 
> 
> Planta 2 ° D-342
> 
> Av. Complutense
> 
> Ciudad Universitaria s/n
> 
> 28040 Madrid
> 
> Tel: +34 91 394 7723



More information about the CWB mailing list