[CWB] sentence-Aligned parallel corpus in CWB

Serge Sharoff s.sharoff at leeds.ac.uk
Sun Feb 21 17:17:33 CET 2016


Thanks Stefan for pushing us towards using cwb-align-import.  My script is 
roughly one decade old.  Also my parallel corpora are coming from pre-aligned 
files (like Europarl or TMX), but cwb-align-import is more general.

Serge

On Sunday 21 Feb 2016 14:43:08 Stefan Evert wrote:
> > On 18 Feb 2016, at 19:00, Serge Sharoff <s.sharoff at leeds.ac.uk> wrote:
> > 
> > After that I run a script with the parameters corresponding to the names
> > of
> > these two corpora
> > 
> >> echo >>/usr/local/share/cwb/registry/$1 ALIGNED $2
> >> cwb-align -V seg -o $1-$2.align $1 $2 seg 1>$1-$2.log
> >> cwb-align-encode -D $1-$2.align 1>>$1-$2.log
> >> echo >>/usr/local/share/cwb/registry/$2 ALIGNED $1
> >> cwb-align -V seg -o $2-$1.align $2 $1 seg 1>$2-$1.log
> >> cwb-align-encode -D $2-$1.align 1>>$2-$1.log
> 
> cwb-align-import allows you to do the same thing without going through a
> "fake" cwb-align call.  If you have encoded your corpora in this format,
> the alignment input file to be imported would look like this
> 
> CORPUS1	CORPUS2	seg	{id}
> 10000001	10000001
> 10000002	10000002
>> 
> where all fields are delimited by TAB stops.  The main purpose of
> cwb-align-import, of course, is to allow you to have different IDs in the
> source and target corpus (where beads are then identified by the pairings
> of IDs in the alignment input file) and to encode alignments that aren't
> 1:1 directly.
> > On Thursday 18 February 2016 2:25:19 PM Marc Reznicek wrote:
> > thanks a lot. Is it posible to nest segments inside each other like in the
> > following example?
> > 
> > <seg id="1000001">
> > 
> > 	<seg id="1000002">
> > 	
> > 		It
> > 	
> > 	</s>
> > 	<seg id="1000003">
> > 	
> > 		rains
> > 	
> > 	</>
> > 	<seg id="1000004">
> > 	
> > 		.
> > 	
> > 	</s>
> > 
> > </s>
> > 
> > To model sentence and word alignment at the same time?
> 
> Unfortunately, that's not possible.  CWB's alignment attributes were
> designed exclusively for sentence alignment.  Like structural attributes,
> they don't allow nesting, and there can only be a single alignment between
> a given pair of corpora.
> 
> Everything will be much better in CWB4, of course, when it finally arrives.
> 
> Best,
> Stefan
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list