[CWB] sentence-Aligned parallel corpus in CWB
Serge Sharoff
s.sharoff at leeds.ac.uk
Sun Feb 21 17:17:33 CET 2016
Thanks Stefan for pushing us towards using cwb-align-import. My script is
roughly one decade old. Also my parallel corpora are coming from pre-aligned
files (like Europarl or TMX), but cwb-align-import is more general.
Serge
On Sunday 21 Feb 2016 14:43:08 Stefan Evert wrote:
> > On 18 Feb 2016, at 19:00, Serge Sharoff <s.sharoff at leeds.ac.uk> wrote:
> >
> > After that I run a script with the parameters corresponding to the names
> > of
> > these two corpora
> >
> >> echo >>/usr/local/share/cwb/registry/$1 ALIGNED $2
> >> cwb-align -V seg -o $1-$2.align $1 $2 seg 1>$1-$2.log
> >> cwb-align-encode -D $1-$2.align 1>>$1-$2.log
> >> echo >>/usr/local/share/cwb/registry/$2 ALIGNED $1
> >> cwb-align -V seg -o $2-$1.align $2 $1 seg 1>$2-$1.log
> >> cwb-align-encode -D $2-$1.align 1>>$2-$1.log
>
> cwb-align-import allows you to do the same thing without going through a
> "fake" cwb-align call. If you have encoded your corpora in this format,
> the alignment input file to be imported would look like this
>
> CORPUS1 CORPUS2 seg {id}
> 10000001 10000001
> 10000002 10000002
> …
>
> where all fields are delimited by TAB stops. The main purpose of
> cwb-align-import, of course, is to allow you to have different IDs in the
> source and target corpus (where beads are then identified by the pairings
> of IDs in the alignment input file) and to encode alignments that aren't
> 1:1 directly.
> > On Thursday 18 February 2016 2:25:19 PM Marc Reznicek wrote:
> > thanks a lot. Is it posible to nest segments inside each other like in the
> > following example?
> >
> > <seg id="1000001">
> >
> > <seg id="1000002">
> >
> > It
> >
> > </s>
> > <seg id="1000003">
> >
> > rains
> >
> > </>
> > <seg id="1000004">
> >
> > .
> >
> > </s>
> >
> > </s>
> >
> > To model sentence and word alignment at the same time?
>
> Unfortunately, that's not possible. CWB's alignment attributes were
> designed exclusively for sentence alignment. Like structural attributes,
> they don't allow nesting, and there can only be a single alignment between
> a given pair of corpora.
>
> Everything will be much better in CWB4, of course, when it finally arrives.
>
> Best,
> Stefan
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
More information about the CWB
mailing list