[CWB] Short sentences inconsistent alignment

"Andrés Chandía" andres at chandia.net
Thu Dec 27 17:08:55 CET 2018




I have aligned the corpus this way:


cwb-align -r registry/ -V s_id -o txtgmmdes_es.align txtgmmdes_es
txtgmmdes_md s


And the other way around for its parallel corpus.



Adding the -V s_id did the trick


Reading this part, helped me:
If we specify pre-alignment with -S, then the aligner assumes that the source and
target corpora have
the same number of paragraphs, and that the first paragraph in the
source (HOLMES-EN) corresponds
to the first paragraph in the target (HOLMES-DE), the
second to the second, and so on. This would be
done as follows:
$ cwb-align -S p -o
holmes.align HOLMES-EN HOLMES-DE s
Alternatively we can use -V. In this case, paragraphs
will not be matched up by order - they are
matched up by the value of the s-attribute.
Since the Holmes corpora input data have num as an
annotation, there is an s-attribute p
num which has values and can be used in this way. This would be
done as follows:
$
cwb-align -V p_num -o holmes.align HOLMES-EN HOLMES-DE s



Thanks a lot!!!!



<style type="text/css">-></style>


The
.align file is read as described in man cwb-align. 


In
brief, cols 1-4 are two pairs of cpos, where the first cpos pair = region in source and the
second cpos pair = aligned region in  target: so what I’m asking is, are the example
sentences you sent with id=73 correctly represented by a line of cpos pairs in the
a-attribute?


(You
can also use cwb-align-decode to check that what is encoded is the same as what is in  your
.align file.)


If
the cpos pairs are not correct for that sentence alignment, then the problem is in
the generation of the .align file. One point to note is that if you used
cwb-align to generate the alignments (??), errors are to be expected for
language pairs which share little or no vocab.


Best


Andrew.




From:
"Andrés Chandía"  
 Sent: 27
December 2018
11:47
 To: Hardie, Andrew 
 Cc: Open source
development of the Corpus WorkBench 
 Subject: RE: [CWB] Short sentences
inconsistent alignment
 

Thanks for the answer, but how do I check that 
these s elements are really aligned with one another in the underlying a-attribute?

 

If you mean to check the align files, how should they be read?, anyway,
here they are (just in case):



    
        
            
            [IMAGE REMOVED]
 Dungupeyem |  IECMap |  ISECMap |  NMT |  Corlexim


 administrador de:
 Parles.upf |  IWCH | Amind terapia |
 ONG Mapuche koyaktu | Nocando | IAC
| CddZ | ISAC | CatCg
 P  No imprima innecesariamente. ¡Cuide el medio
ambiente!
            


_______________________
             andrés
chandía
 
Dungupeyem | IECMap | ISECMap | NMT | Corlexim

administrador de:
Parles.upf | IWCH | Amind
terapia | ONG Mapuche koyaktu | Nocando | IAC | CddZ | ISAC | CatCg
P No imprima innecesariamente. ¡Cuide el medio ambiente!
        
    

 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20181227/773fce72/attachment.html>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20181227/773fce72/attachment-0001.html>


More information about the CWB mailing list