[CWB] Error cwb-align-import
Jose Manuel Martinez Martinez
jmmtra at gmail.com
Tue Jun 23 18:24:02 CEST 2015
Dear all,
I've managed to import the alignment of two corpora at sentence level. I
don't mind to document the process somehow for the encoding tutorial.
However, I had came across with an error when trying to align structural
attributes in a different corpus.
> sh add_difficulties_align_test.sh
Generating keys for grid regions:
- TDC-AD-TEST ..... ok
- TDC-TT-TEST ..... ok
Processing .Error: alignment bead #4 is non-contiguous in TDC-TT-TEST
(keys: ep1_tr10_dif_3 ep1_tr10_dif_4)
You can find attached a test data set to reproduce the issue. My
question is, is there a way to overcome this error?
This alignment is basically some kind of "word alignment", however I am
not aligning all words, but only those words on the source text
contained within a structural attribute, and I align them only with the
structural attribute(s) containing the translation. Sometimes, depending
on the source text unit, the translation is a non-contiguous rendering.
See the example below, specially difficulty id="ep1_tr10_dif_3" in the
source text and its translation (difficulty id="ep1_tr10_dif_3" and
difficulty id="ep1_tr10_dif_4").
#-- source
the
<difficulty id="ep1_tr10_dif_2" type="unspec">
interbank
market
</difficulty>
is
<difficulty id="ep1_tr10_dif_3" type="unspec">
restarted
</difficulty>
.
#-- translation
el
<difficulty id="ep1_tr10_dif_2" type="unspec">
mercado
interbancario
</difficulty>
<difficulty id="ep1_tr10_dif_3" type="unspec">
vuelva
a
poner
</difficulty>
se
<difficulty id="ep1_tr10_dif_4" type="unspec">
en
marcha
</difficulty>
.
#-- alignment
ep1_tr10_dif_2 ep1_tr10_dif_2
ep1_tr10_dif_3 ep1_tr10_dif_3 ep1_tr10_dif_4
I also tried to wrap each work with an XML element like:
<token id="ep1_tr10_t_2">
mercado
</token>
<token id="ep1_tr10_t_3">
interbancario
</token>
<token id="ep1_tr10_t_4">
vuelva
</token>
<token id="ep1_tr10_t_5">
a
</token>
<token id="ep1_tr10_t_6">
poner
</token>
<token id="ep1_tr10_t_54">
se
</token>
<token id="ep1_tr10_t_7">
en
</token>
<token id="ep1_tr10_t_8">
marcha
</token>
So the tokens involved in the alignment have to be contiguous (not the
structural elements). In the example given, this is trivial (one token
more or less...), but I have other cases where elements appear much far
apart and I don't want to include all the tokens in between.
Although my case is a bit special, I don't think this is an infrequent
scenario see Amoia et al. 2011 http://www.aclweb.org/anthology/W11-4302.
Any comments, hints, will be much appreciated.
Cheers,
jmm
------------ pr�xima parte ------------
Se ha borrado un adjunto en formato HTML...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20150623/d8234080/attachment-0001.html>
------------ pr�xima parte ------------
ep1_tr10_dif_0 ep1_tr10_dif_0
ep1_tr10_dif_1 ep1_tr10_dif_1
ep1_tr10_dif_2 ep1_tr10_dif_2
ep1_tr10_dif_3 ep1_tr10_dif_3 ep1_tr10_dif_4
ep1_tr10_dif_4 ep1_tr10_dif_5
ep1_tr10_dif_5 ep1_tr10_dif_6
ep1_tr10_dif_6 ep1_tr10_dif_7
ep1_tr10_dif_7 ep1_tr10_dif_8
ep1_tr10_dif_8 ep1_tr10_dif_9 ep1_tr10_dif_10
------------ pr�xima parte ------------
Preparation
of
the
<difficulty id="ep1_tr10_dif_0" type="unspec">
European
Council
</difficulty>
,
<difficulty id="ep1_tr10_dif_1" type="unspec">
including
</difficulty>
the
situation
of
the
global
financial
system
(
continuation
of
debate
)
Mr
President
,
I
would
say
to
you
,
and
to
Mr
Jouyet
and
Mr
Almunia
,
that
it
is
absolutely
essential
that
the
<difficulty id="ep1_tr10_dif_2" type="unspec">
interbank
market
</difficulty>
is
<difficulty id="ep1_tr10_dif_3" type="unspec">
restarted
</difficulty>
.
------------ pr�xima parte ------------
A non-text attachment was scrubbed...
Name: ep1.tr10.ad.vrt
Type: text/xml
Size: 11859 bytes
Desc: no disponible
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20150623/d8234080/attachment-0002.xml>
------------ pr�xima parte ------------
Preparación
de
el
<difficulty id="ep1_tr10_dif_0" type="unspec">
Consejo
Europeo
</difficulty>
,
<difficulty id="ep1_tr10_dif_1" type="unspec">
incluyendo
</difficulty>
la
situación
de
el
sistema
financiero
mundial
(
continuación
de
el
debate
)
.
Sr.
Presidente
le
diré
a
usted
,
a
el
Sr.
Jouyet
y
a
el
Sr.
Almunia
,
que
es
absolutamente
necesario
que
el
<difficulty id="ep1_tr10_dif_2" type="unspec">
mercado
interbancario
</difficulty>
<difficulty id="ep1_tr10_dif_3" type="unspec">
vuelva
a
poner
</difficulty>
se
<difficulty id="ep1_tr10_dif_4" type="unspec">
en
marcha
</difficulty>
.
------------ pr�xima parte ------------
A non-text attachment was scrubbed...
Name: ep1.tr10.tt.vrt
Type: text/xml
Size: 22760 bytes
Desc: no disponible
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20150623/d8234080/attachment-0003.xml>
------------ pr�xima parte ------------
A non-text attachment was scrubbed...
Name: test.sh
Type: application/x-sh
Size: 1903 bytes
Desc: no disponible
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20150623/d8234080/attachment-0001.sh>
More information about the CWB
mailing list