[CWB] sentence-Aligned parallel corpus in CWB
Stefan Evert
stefanML at collocations.de
Sun Feb 21 14:43:08 CET 2016
> On 18 Feb 2016, at 19:00, Serge Sharoff <s.sharoff at leeds.ac.uk> wrote:
>
> After that I run a script with the parameters corresponding to the names of
> these two corpora
>
>> echo >>/usr/local/share/cwb/registry/$1 ALIGNED $2
>> cwb-align -V seg -o $1-$2.align $1 $2 seg 1>$1-$2.log
>> cwb-align-encode -D $1-$2.align 1>>$1-$2.log
>> echo >>/usr/local/share/cwb/registry/$2 ALIGNED $1
>> cwb-align -V seg -o $2-$1.align $2 $1 seg 1>$2-$1.log
>> cwb-align-encode -D $2-$1.align 1>>$2-$1.log
cwb-align-import allows you to do the same thing without going through a "fake" cwb-align call. If you have encoded your corpora in this format, the alignment input file to be imported would look like this
CORPUS1 CORPUS2 seg {id}
10000001 10000001
10000002 10000002
…
where all fields are delimited by TAB stops. The main purpose of cwb-align-import, of course, is to allow you to have different IDs in the source and target corpus (where beads are then identified by the pairings of IDs in the alignment input file) and to encode alignments that aren't 1:1 directly.
> On Thursday 18 February 2016 2:25:19 PM Marc Reznicek wrote:
> thanks a lot. Is it posible to nest segments inside each other like in the following example?
>
> <seg id="1000001">
> <seg id="1000002">
> It
> </s>
> <seg id="1000003">
> rains
> </>
> <seg id="1000004">
> .
> </s>
> </s>
>
> To model sentence and word alignment at the same time?
Unfortunately, that's not possible. CWB's alignment attributes were designed exclusively for sentence alignment. Like structural attributes, they don't allow nesting, and there can only be a single alignment between a given pair of corpora.
Everything will be much better in CWB4, of course, when it finally arrives.
Best,
Stefan
More information about the CWB
mailing list