[CWB] Error cwb-align-import

Jose Manuel Martinez Martinez jmmtra at gmail.com
Tue Jun 23 18:24:02 CEST 2015


Dear all,

I've managed to import the alignment of two corpora at sentence level. I 
don't mind to document the process somehow for the encoding tutorial.

However, I had came across with an error when trying to align structural 
attributes in a different corpus.

 > sh add_difficulties_align_test.sh
Generating keys for grid regions:
   - TDC-AD-TEST ..... ok
   - TDC-TT-TEST ..... ok
Processing .Error: alignment bead #4 is non-contiguous in TDC-TT-TEST
     (keys: ep1_tr10_dif_3 ep1_tr10_dif_4)

You can find attached a test data set to reproduce the issue. My 
question is, is there a way to overcome this error?

This alignment is basically some kind of "word alignment", however I am 
not aligning all words, but only those words on the source text 
contained within a structural attribute, and I align them only with the 
structural attribute(s) containing the translation. Sometimes, depending 
on the source text unit, the translation is a non-contiguous rendering. 
See the example below, specially difficulty id="ep1_tr10_dif_3" in the 
source text and its translation (difficulty id="ep1_tr10_dif_3" and 
difficulty id="ep1_tr10_dif_4").

#-- source

the
<difficulty id="ep1_tr10_dif_2" type="unspec">
interbank
market
</difficulty>
is
<difficulty id="ep1_tr10_dif_3" type="unspec">
restarted
</difficulty>
.

#-- translation

el
<difficulty id="ep1_tr10_dif_2" type="unspec">
mercado
interbancario
</difficulty>
<difficulty id="ep1_tr10_dif_3" type="unspec">
vuelva
a
poner
</difficulty>
se
<difficulty id="ep1_tr10_dif_4" type="unspec">
en
marcha
</difficulty>
.

#-- alignment

ep1_tr10_dif_2    ep1_tr10_dif_2
ep1_tr10_dif_3    ep1_tr10_dif_3 ep1_tr10_dif_4

I also tried to wrap each work with an XML element like:

<token id="ep1_tr10_t_2">
mercado
</token>
<token id="ep1_tr10_t_3">
interbancario
</token>
<token id="ep1_tr10_t_4">
vuelva
</token>
<token id="ep1_tr10_t_5">
a
</token>
<token id="ep1_tr10_t_6">
poner
</token>
<token id="ep1_tr10_t_54">
se
</token>
<token id="ep1_tr10_t_7">
en
</token>
<token id="ep1_tr10_t_8">
marcha
</token>

So the tokens involved in the alignment have to be contiguous (not the 
structural elements). In the example given, this is trivial (one token 
more or less...), but I have other cases where elements appear much far 
apart and I don't want to include all the tokens in between.

Although my case is a bit special, I don't think this is an infrequent 
scenario see Amoia et al. 2011 http://www.aclweb.org/anthology/W11-4302.

Any comments, hints, will be much appreciated.

Cheers,

jmm
------------ pr�xima parte ------------
Se ha borrado un adjunto en formato HTML...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20150623/d8234080/attachment-0001.html>
------------ pr�xima parte ------------
ep1_tr10_dif_0	ep1_tr10_dif_0
ep1_tr10_dif_1	ep1_tr10_dif_1
ep1_tr10_dif_2	ep1_tr10_dif_2
ep1_tr10_dif_3	ep1_tr10_dif_3 ep1_tr10_dif_4
ep1_tr10_dif_4	ep1_tr10_dif_5
ep1_tr10_dif_5	ep1_tr10_dif_6
ep1_tr10_dif_6	ep1_tr10_dif_7
ep1_tr10_dif_7	ep1_tr10_dif_8
ep1_tr10_dif_8	ep1_tr10_dif_9 ep1_tr10_dif_10
------------ pr�xima parte ------------
Preparation
of
the
<difficulty id="ep1_tr10_dif_0" type="unspec">
European
Council
</difficulty>
,
<difficulty id="ep1_tr10_dif_1" type="unspec">
including
</difficulty>
the
situation
of
the
global
financial
system
(
continuation
of
debate
)
Mr
President
,
I
would
say
to
you
,
and
to
Mr
Jouyet
and
Mr
Almunia
,
that
it
is
absolutely
essential
that
the
<difficulty id="ep1_tr10_dif_2" type="unspec">
interbank
market
</difficulty>
is
<difficulty id="ep1_tr10_dif_3" type="unspec">
restarted
</difficulty>
.
------------ pr�xima parte ------------
A non-text attachment was scrubbed...
Name: ep1.tr10.ad.vrt
Type: text/xml
Size: 11859 bytes
Desc: no disponible
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20150623/d8234080/attachment-0002.xml>
------------ pr�xima parte ------------
Preparación
de
el
<difficulty id="ep1_tr10_dif_0" type="unspec">
Consejo
Europeo
</difficulty>
,
<difficulty id="ep1_tr10_dif_1" type="unspec">
incluyendo
</difficulty>
la
situación
de
el
sistema
financiero
mundial
(
continuación
de
el
debate
)
.
Sr.
Presidente
le
diré
a
usted
,
a
el
Sr.
Jouyet
y
a
el
Sr.
Almunia
,
que
es
absolutamente
necesario
que
el
<difficulty id="ep1_tr10_dif_2" type="unspec">
mercado
interbancario
</difficulty>
<difficulty id="ep1_tr10_dif_3" type="unspec">
vuelva
a
poner
</difficulty>
se
<difficulty id="ep1_tr10_dif_4" type="unspec">
en
marcha
</difficulty>
.
------------ pr�xima parte ------------
A non-text attachment was scrubbed...
Name: ep1.tr10.tt.vrt
Type: text/xml
Size: 22760 bytes
Desc: no disponible
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20150623/d8234080/attachment-0003.xml>
------------ pr�xima parte ------------
A non-text attachment was scrubbed...
Name: test.sh
Type: application/x-sh
Size: 1903 bytes
Desc: no disponible
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20150623/d8234080/attachment-0001.sh>


More information about the CWB mailing list