[CWB] Error cwb-align-import

Ruprecht von Waldenfels ruprecht.waldenfels at gmx.net
Wed Jun 24 09:38:14 CEST 2015


Hi,
I don't know whether this helps, but I use positional attributes to 
encode word alignment, and then transform the output to reflect this 
function. Essentially, you can put anything into these positional 
attributes, also ranges, be they continuous or not. The challenge just 
moves on to transforming the output.

My solution is to have CWB output, including word alignment in the 
positional attributes, as XML, and transform that using XSLT. Have a 
look here: http://www.parasolcorpus.org/KrakowMW/

The interface is open source (https://bitbucket.org/rvwfels/paravoz2 ) , 
but we just found a bug which isn't repaired yet, so write to me for 
details if you want to try it out (essentially, you need to follow a 
certain naming convention when encoding the corpus).

Best!
Ruprecht


Am 23.06.2015 um 18:24 schrieb Jose Manuel Martinez Martinez:
> Dear all,
>
> I've managed to import the alignment of two corpora at sentence level. 
> I don't mind to document the process somehow for the encoding tutorial.
>
> However, I had came across with an error when trying to align 
> structural attributes in a different corpus.
>
> > sh add_difficulties_align_test.sh
> Generating keys for grid regions:
>   - TDC-AD-TEST ..... ok
>   - TDC-TT-TEST ..... ok
> Processing .Error: alignment bead #4 is non-contiguous in TDC-TT-TEST
>     (keys: ep1_tr10_dif_3 ep1_tr10_dif_4)
>
> You can find attached a test data set to reproduce the issue. My 
> question is, is there a way to overcome this error?
>
> This alignment is basically some kind of "word alignment", however I 
> am not aligning all words, but only those words on the source text 
> contained within a structural attribute, and I align them only with 
> the structural attribute(s) containing the translation. Sometimes, 
> depending on the source text unit, the translation is a non-contiguous 
> rendering. See the example below, specially difficulty 
> id="ep1_tr10_dif_3" in the source text and its translation (difficulty 
> id="ep1_tr10_dif_3" and difficulty id="ep1_tr10_dif_4").
>
> #-- source
>
> the
> <difficulty id="ep1_tr10_dif_2" type="unspec">
> interbank
> market
> </difficulty>
> is
> <difficulty id="ep1_tr10_dif_3" type="unspec">
> restarted
> </difficulty>
> .
>
> #-- translation
>
> el
> <difficulty id="ep1_tr10_dif_2" type="unspec">
> mercado
> interbancario
> </difficulty>
> <difficulty id="ep1_tr10_dif_3" type="unspec">
> vuelva
> a
> poner
> </difficulty>
> se
> <difficulty id="ep1_tr10_dif_4" type="unspec">
> en
> marcha
> </difficulty>
> .
>
> #-- alignment
>
> ep1_tr10_dif_2    ep1_tr10_dif_2
> ep1_tr10_dif_3    ep1_tr10_dif_3 ep1_tr10_dif_4
>
> I also tried to wrap each work with an XML element like:
>
> <token id="ep1_tr10_t_2">
> mercado
> </token>
> <token id="ep1_tr10_t_3">
> interbancario
> </token>
> <token id="ep1_tr10_t_4">
> vuelva
> </token>
> <token id="ep1_tr10_t_5">
> a
> </token>
> <token id="ep1_tr10_t_6">
> poner
> </token>
> <token id="ep1_tr10_t_54">
> se
> </token>
> <token id="ep1_tr10_t_7">
> en
> </token>
> <token id="ep1_tr10_t_8">
> marcha
> </token>
>
> So the tokens involved in the alignment have to be contiguous (not the 
> structural elements). In the example given, this is trivial (one token 
> more or less...), but I have other cases where elements appear much 
> far apart and I don't want to include all the tokens in between.
>
> Although my case is a bit special, I don't think this is an infrequent 
> scenario see Amoia et al. 2011 http://www.aclweb.org/anthology/W11-4302.
>
> Any comments, hints, will be much appreciated.
>
> Cheers,
>
> jmm
>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20150624/709cc190/attachment.html>


More information about the CWB mailing list