[CWB] A question about the aligning using cwb-encoding
Andres Chandia
andres at chandia.net
Thu Feb 20 18:34:15 CET 2014
Content of the tiny corpus:
We want to index parallel corpora.
Queremos indexar corpus paralelos.
This is a test of parallel corpora indexing in CQP.
Esta es una prueba de
indexación de corpus paralelos en CQP.
I write down here a longer sentence to test this method,
Escribo aquà una
frase más larga para probar este método,
it seems to me that this would work.
me parece que esto funcionarÃa.
On Thu, February 20, 2014 18:24, Josep M. Fontana wrote:
Andrés, jo no veig que funcioni. A no ser que sigui
tan tiny, tan tiny que no hi hagis posat les paraules que jo he intentat cercar. He
posat "the man" o simplement "the" i no troba res.
JM
Dear Ray Wu
I did it the way you suggest, it is easy and clear: here my test parallel
corpus: parallel tiny
test corpus (english-spanish)
user and password: guest
On Sat, February 15, 2014 10:05, Ray Wu wrote:
Hi all,
Andrew is right. We made no modification to the code and simply used the
translation-visualisation feature. It can be achieved like this:
Step 1: Prepare a CQPweb-compatible corpus file
ââ¬Åtest.txtâ⬠(in utf-8 format):
The original language">
The
translated
text
.
Step 2: When installing a
new corpus, go to configure the corpus by specifying the info as required by
ââ¬ÅS-attributes (XML elements) -> Use custom setupââ¬
as:
0+trans
(NB: Specify ââ¬ÅP-attributesâ⬠as necessary if
your corpus is different from mine.)
Step
3: When everything done, go to ââ¬ÅManage
visualisations-> Free translation -> Select XML element/attribute to get
the translation fromâ⬠and choose
ââ¬Ås_transâ⬠to provide whole-sentence
translation.
Although it works, it certainly lacks some features provided by cwb-align, for
instance, it doesn't support the alignment of more than two languages. We
are still finding ways to address this issue.
Best,
Ray
At 2014-02-14 04:41:09,"Hardie, Andrew" wrote:
It looks to me like they are using the
translation-visualisation feature. This is really designed for interlinear field
data, where you would have the original language as the word p-attribute,
the morpheme gloss as the primary annotation p-attribute, and the free
translation as an annotated s-attribute. However, I built it in such a way that
you can turn on translations without glossing. I think that's what they've
done, putting one corpus into the XML of the other. No reason why others
shouldn't be able to use the same trick.
Worth noting once again that I never actually finished work on the advanced
visualisations.
Best
Andrew.
"Josep M. Fontana" josepm.fontana at upf.edu<style type="text/css">-></style>
>>> Is it possible right now to use the
CQPweb interface to exploit parallel corpora?
>>> The question is: is the future here already?
> No.
>
> This is still planned, but I
have not had time to do it yet.
OK, so this means that the people who did this had to do quite a
bit of
hacking:
http://124.193.83.252/cqp/
If you notice, at the end there are a few parallel corpora. Now
the
access is restricted but I had been able to access
and it really seemed
to work well.
JM
>
> best
>
> Andrew.
>
> -----Original
Message-----
>
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of
Josep M. Fontana
> Sent: 13 February 2014 17:11
> To: cwb at sslmit.unibo.it
> Subject: Re: [CWB] A question about the aligning using
cwb-encoding
>
> I just found this old
thread on alignment and this reminded me of something that I had
wanted to ask for a while. Is it possible right now to use the CQPweb
interface to exploit parallel corpora? We have parallel
corpora from translations between different languages (so the alignment is
already done) but these are using a very problematic and
proprietary interface. We would like to move all of our corpora to the best
web interface there is, CQPweb, of course :-)
>
> I found a paper written by Andrew
> (http://www.lancaster.ac.uk/people/hardiea/cqpweb-paper.pdf)
where he talks about using CQPweb with parallel corpora but as
something he was planning for the future: "Other planned extensions
remain to be
> implemented: support for concordancing
across parallel corpora;".
>
> The
question is: is the future here already?
>
> JM
>>> Some first sentences were
aligned as right pairs.
>>> But the others were not.
>>> It seems to be related with statistical aligning
process.
>> You're absolutely right. cwb-align isn't a
particularly sophisticated sentence aligner, so it's likely to get some
cases wrong. You may be seeing particularly bad performance if
you're using the default parameter settings, which are intended for related
languages and are based on sentence length (in characters),
character n-gram counts and identical words.
>>
>> For Korean-English alignment, the best solution
might be to get a good bilingual word list and use that as the
only feature (dropping even sentence length).
>>
>>> Actually I made two corpora so, that every
pair sentence should have the same sentence id like or , in order to avoid the failure of statistical
alignment.
>>> I am working with 60000 sentences. And
I manually aligned all sentences and put the information
into the xml tag "s_id".
>>>
>>> My question is how I can make useful
the manually created xml tag "s_id"?
>> If these are only 1:1 alignments, you can use a trick
to smuggle them past cwb-align:
>>
>> cwb-align -V s_id -o alignment.txt CORPUS1
CORPUS2 s -C:1
>>
>> With
"-V s_id", the manually aligned sentence pairs are taken as a
pre-alignment, and the statistical aligner is only run
within each pair of pre-aligned regions. Since each of those contains just
a single sentence pair, it cannot further break up the
bead, so the original pre-aligment is passed through. Feature specs
shouldn't matter here, so you might as well just specify -C:1
to avoid unnecessary overhead. You can then proceed to
cwb-align-encode the generated file alignment.txt as usual.
>>
>> If you have more complex alignments
(n:1 or 1:n, 2:2, ...), you could add new XML regions,
e.g.
>>
>> ...
>>
>> and use -V
bead_id for the pre-alignment in cwb-align.
>>
>>
>> If you have a
recent version of the CWB/Perl interface, the best strategy
is to use the cwb-align-import tool. You'll have to provide a
separate alignment file that lists the sentence IDs in source and
target corpus for each alignment bead. Complex alignments
require no special treatment with this tool. See "perldoc
cwb-align-import" for usage and format details.
>>
>>
>> Best,
>> Stefan
>>
_______________________________________________
>> CWB mailing
list
>> CWB at sslmit.unibo.it
>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> _______________________________________________
>
CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> _______________________________________________
>
CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb
_______________________
andrés chandÃa
administrador
de
parles.upf.edu
psicoaching.net
mapuche koyaktu
ong mapuche koyaktu
P No imprima
innecesariamente. ¡Cuide el medio ambiente!
_______________________________________________ CWB mailing list CWB at sslmit.unibo.it http://devel.sslmit.unibo.it/mailman/listinfo/cwb
_______________________
            andrés
chandÃa
administrador de
parles.upf.edu
psicoaching.net
mapuche koyaktu
ong mapuche koyaktu
P No imprima innecesariamente. ¡Cuide el medio ambiente!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20140220/7926ae8c/attachment-0001.html>
More information about the CWB
mailing list