[CWB] Short sentences inconsistent alignment

Hardie, Andrew a.hardie at lancaster.ac.uk
Fri Dec 28 10:58:56 CET 2018


I'm glad you have a solution that works, but note you don't even need cwb-align if your data is already fully aligned by the sentence IDs... You can just use cwb-align-import instead.

Best

Andrew.

From: cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> On Behalf Of "Andrés Chandía"
Sent: 27 December 2018 16:09
To: Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it>
Subject: Re: [CWB] Short sentences inconsistent alignment

I have aligned the corpus this way:

cwb-align -r registry/ -V s_id -o txtgmmdes_es.align txtgmmdes_es txtgmmdes_md s

And the other way around for its parallel corpus.

Adding the -V s_id did the trick

Reading this part, helped me:
If we specify pre-alignment with -S, then the aligner assumes that the source and target corpora have
the same number of paragraphs, and that the first paragraph in the source (HOLMES-EN) corresponds
to the first paragraph in the target (HOLMES-DE), the second to the second, and so on. This would be
done as follows:
$ cwb-align -S p -o holmes.align HOLMES-EN HOLMES-DE s
Alternatively we can use -V. In this case, paragraphs will not be matched up by order - they are
matched up by the value of the s-attribute. Since the Holmes corpora input data have num as an
annotation, there is an s-attribute p num which has values and can be used in this way. This would be
done as follows:
$ cwb-align -V p_num -o holmes.align HOLMES-EN HOLMES-DE s

Thanks a lot!!!!


El Jue, 27 de Diciembre de 2018, 13:58, Hardie, Andrew escribió:
The .align file is read as described in man cwb-align.
In brief, cols 1-4 are two pairs of cpos, where the first cpos pair = region in source and the second cpos pair = aligned region in target: so what I'm asking is, are the example sentences you sent with id=73 correctly represented by a line of cpos pairs in the a-attribute?
(You can also use cwb-align-decode to check that what is encoded is the same as what is in your .align file.)
If the cpos pairs are not correct for that sentence alignment, then the problem is in the generation of the .align file. One point to note is that if you used cwb-align to generate the alignments (??), errors are to be expected for language pairs which share little or no vocab.
Best
Andrew.
From: "Andrés Chandía"
Sent: 27
December 2018 11:47
To: Hardie, Andrew
Cc: Open source development of the Corpus WorkBench
Subject: RE: [CWB] Short sentences inconsistent alignment
Thanks for the answer, but how do I check that these s elements are really aligned with one another in the underlying a-attribute?
If you mean to check the align files, how should they be read?, anyway, here they are (just in case):
[IMAGE REMOVED]
Dungupeyem<http://chandia.net/content/dungupeyem> | IECMap<http://chandia.net/content/iecmap> | ISECMap<http://chandia.net/content/isecmap> | NMT<http://chandia.net/content/nmt> | Corlexim<http://corlexim.cl>

administrador de:
Parles.upf<http://parles.upf.edu> | IWCH<https://iwch.upf.edu> | Amind terapia<http://amindterapia.com> | ONG Mapuche koyaktu<http://koyaktumapuche.net> | Nocando<http://parles.upf.edu/llocs/nocando> | IAC<https://iac.upf.edu> | CddZ<https://iac.upf.edu/cddz> | ISAC<https://iac.upf.edu/isac> | CatCg<http://catcg.upf.edu>
P No imprima innecesariamente. ¡Cuide el medio ambiente!



_______________________
andrés chandía
[Image removed by sender. chandia.net]<http://www.chandia.net>[Image removed by sender.]<https://twitter.com/chandianet>
Dungupeyem<http://chandia.net/content/dungupeyem> | IECMap<http://chandia.net/content/iecmap> | ISECMap<http://chandia.net/content/isecmap> | NMT<http://chandia.net/content/nmt> | Corlexim<http://corlexim.cl>

administrador de:
Parles.upf<http://parles.upf.edu> | IWCH<https://iwch.upf.edu> | Amind terapia<http://amindterapia.com> | ONG Mapuche koyaktu<http://koyaktumapuche.net> | Nocando<http://parles.upf.edu/llocs/nocando> | IAC<https://iac.upf.edu> | CddZ<https://iac.upf.edu/cddz> | ISAC<https://iac.upf.edu/isac> | CatCg<http://catcg.upf.edu>
P No imprima innecesariamente. ¡Cuide el medio ambiente!


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20181228/396fb788/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ~WRD000.jpg
Type: image/jpeg
Size: 823 bytes
Desc: ~WRD000.jpg
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20181228/396fb788/attachment-0002.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.jpg
Type: image/jpeg
Size: 338 bytes
Desc: image001.jpg
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20181228/396fb788/attachment-0003.jpg>


More information about the CWB mailing list