[CWB] Short sentences inconsistent alignment
"Andrés Chandía"
andres at chandia.net
Thu Dec 27 17:08:55 CET 2018
I have aligned the corpus this way:
cwb-align -r registry/ -V s_id -o txtgmmdes_es.align txtgmmdes_es
txtgmmdes_md s
And the other way around for its parallel corpus.
Adding the -V s_id did the trick
Reading this part, helped me:
If we specify pre-alignment with -S, then the aligner assumes that the source and
target corpora have
the same number of paragraphs, and that the first paragraph in the
source (HOLMES-EN) corresponds
to the first paragraph in the target (HOLMES-DE), the
second to the second, and so on. This would be
done as follows:
$ cwb-align -S p -o
holmes.align HOLMES-EN HOLMES-DE s
Alternatively we can use -V. In this case, paragraphs
will not be matched up by order - they are
matched up by the value of the s-attribute.
Since the Holmes corpora input data have num as an
annotation, there is an s-attribute p
num which has values and can be used in this way. This would be
done as follows:
$
cwb-align -V p_num -o holmes.align HOLMES-EN HOLMES-DE s
Thanks a lot!!!!
<style type="text/css">-></style>
The
.align file is read as described in man cwb-align.
In
brief, cols 1-4 are two pairs of cpos, where the first cpos pair = region in source and the
second cpos pair = aligned region in target: so what Iâm asking is, are the example
sentences you sent with id=73 correctly represented by a line of cpos pairs in the
a-attribute?
(You
can also use cwb-align-decode to check that what is encoded is the same as what is in your
.align file.)
If
the cpos pairs are not correct for that sentence alignment, then the problem is in
the generation of the .align file. One point to note is that if you used
cwb-align to generate the alignments (??), errors are to be expected for
language pairs which share little or no vocab.
Best
Andrew.
From:
"Andrés ChandÃa"
Sent: 27
December 2018
11:47
To: Hardie, Andrew
Cc: Open source
development of the Corpus WorkBench
Subject: RE: [CWB] Short sentences
inconsistent alignment
Thanks for the answer, but how do I check that
these s elements are really aligned with one another in the underlying a-attribute?
If you mean to check the align files, how should they be read?, anyway,
here they are (just in case):
[IMAGE REMOVED]
Dungupeyem | IECMap | ISECMap | NMT | Corlexim
administrador de:
Parles.upf | IWCH | Amind terapia |
ONG Mapuche koyaktu | Nocando | IAC
| CddZ | ISAC | CatCg
P No imprima innecesariamente. ¡Cuide el medio
ambiente!
_______________________
andrés
chandÃa
Dungupeyem | IECMap | ISECMap | NMT | Corlexim
administrador de:
Parles.upf | IWCH | Amind
terapia | ONG Mapuche koyaktu | Nocando | IAC | CddZ | ISAC | CatCg
P No imprima innecesariamente. ¡Cuide el medio ambiente!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20181227/773fce72/attachment.html>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20181227/773fce72/attachment-0001.html>
More information about the CWB
mailing list