[CWB] sentence-Aligned parallel corpus in CWB

Stefan Evert stefanML at collocations.de
Sun Feb 21 14:43:08 CET 2016


> On 18 Feb 2016, at 19:00, Serge Sharoff <s.sharoff at leeds.ac.uk> wrote:
> 
> After that I run a script with the parameters corresponding to the names of 
> these two corpora
> 
>> echo >>/usr/local/share/cwb/registry/$1 ALIGNED $2
>> cwb-align -V seg -o $1-$2.align $1 $2 seg 1>$1-$2.log
>> cwb-align-encode -D $1-$2.align 1>>$1-$2.log
>> echo >>/usr/local/share/cwb/registry/$2 ALIGNED $1
>> cwb-align -V seg -o $2-$1.align $2 $1 seg 1>$2-$1.log
>> cwb-align-encode -D $2-$1.align 1>>$2-$1.log

cwb-align-import allows you to do the same thing without going through a "fake" cwb-align call.  If you have encoded your corpora in this format, the alignment input file to be imported would look like this

CORPUS1	CORPUS2	seg	{id}
10000001	10000001
10000002	10000002
…

where all fields are delimited by TAB stops.  The main purpose of cwb-align-import, of course, is to allow you to have different IDs in the source and target corpus (where beads are then identified by the pairings of IDs in the alignment input file) and to encode alignments that aren't 1:1 directly.

> On Thursday 18 February 2016 2:25:19 PM Marc Reznicek wrote:
> thanks a lot. Is it posible to nest segments inside each other like in the following example?
> 
> <seg id="1000001">
> 	<seg id="1000002">
> 		It
> 	</s>
> 	<seg id="1000003">
> 		rains
> 	</>
> 	<seg id="1000004">
> 		.
> 	</s>
> </s>
> 
> To model sentence and word alignment at the same time?


Unfortunately, that's not possible.  CWB's alignment attributes were designed exclusively for sentence alignment.  Like structural attributes, they don't allow nesting, and there can only be a single alignment between a given pair of corpora.

Everything will be much better in CWB4, of course, when it finally arrives.

Best,
Stefan


More information about the CWB mailing list