[CWB] Using corpora alignment feature of CWB
Hardie, Andrew
a.hardie at lancaster.ac.uk
Thu Jul 22 07:12:45 CEST 2010
Hi Matthieu,
Adding ALIGNED to the registry isn't enough. I think you need to run cwb-align and then cwb-align-encode at some point in order to actually create the alignment attribute. OR, alternatively, cwb-align-import. Alas, especially the former two are largely undocumented, the most extensive material so far is the following:
http://cwb.svn.sourceforge.net/viewvc/cwb/cwb/trunk/man/cwb-align.pod
(aka man cwb-align)
http://cwb.svn.sourceforge.net/viewvc/cwb/cwb/trunk/man/cwb-align-encode.pod
(aka man cwb-align-encode)
http://cwb.svn.sourceforge.net/viewvc/cwb/perl/trunk/CWB/script/cwb-align-import
from line 272 (aka man cwb-align-import)
http://liste.sslmit.unibo.it/pipermail/cwb/2007-February/000064.html
(An old email from Stefan which explains some of the steps in aligning)
I thnk you'll need something like
cwb-align -S seg tmxfr tmxen seg
or
cwb-align -V seg tmxfr tmxen seg
(the former if identical segs elements are indicated by order; the latter if by identical attributes).
and then something like
cwb-align-encode out.align
... with extra options (ie -r etc.) on some or all of these commands as necessary!
Hope this helps -- Stefan may have more to add on this issue.
best
Andrew.
-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Matthieu Decorde
Sent: 21 July 2010 14:01
To: cwb at sslmit.unibo.it
Subject: [CWB] Using corpora alignment feature of CWB
Dear all,
I'm trying to use the corpora alignment feature of CWB.
I've built my source files (tmxfr.wtc and tmxen.wtc), and called
cwb-encode on them.
Then I tuned the registry files by adding 'ALIGNED <theothercorpus>' in
each registry file
in the declaration part of the <seg> structural attribute.
The following CQP script:
TMXFR;
"séance" :TMXEN "meeting";
shows :
"0 match"
And we know we should get one match.
Am I missing something or doing something wrong ?
Thanks for any reply.
Best,
Matthieu
An archive of the files I used is at :
http://mercure.ens-lsh.fr/get?k=4KVeQoloWRT5FaO8hOF
=======
The commands I used
rm -rf /home/mdecorde/TXM/corpora/tmxtest/data/fr/*
rm -rf /home/mdecorde/TXM/corpora/tmxtest/data/en/*
/home/mdecorde/TXMinstall/cwb/bin/cwb-encode -d
/home/mdecorde/TXM/corpora/tmxtest/data/fr -f
/home/mdecorde/TXM/corpora/tmxtest/wtc/tmxfr.wtc -R
/home/mdecorde/TXM/corpora/tmxtest/registry/tmxfr -c utf8 -xsB -xsB -P
pos -P lemma -P id -S text:0+base+project+id -S
tu:0+tuid+committee+vote+lead+session -S seg:0+id
/home/mdecorde/TXMinstall/cwb/bin/cwb-encode -d
/home/mdecorde/TXM/corpora/tmxtest/data/en -f
/home/mdecorde/TXM/corpora/tmxtest/wtc/tmxen.wtc -R
/home/mdecorde/TXM/corpora/tmxtest/registry/tmxen -c utf8 -xsB -xsB -P
pos -P lemma -P id -S text:0+base+project+id -S
tu:0+tuid+committee+vote+lead+session -S seg:0+id
/home/mdecorde/TXMinstall/cwb/bin/cwb-makeall -r
/home/mdecorde/TXM/corpora/tmxtest/registry -V tmxfr
/home/mdecorde/TXMinstall/cwb/bin/cwb-makeall -r
/home/mdecorde/TXM/corpora/tmxtest/registry -V tmxen
===============================
Registry file :
##
## registry entry for corpus TMXFR
##
# long descriptive name for the corpus
NAME ""
# corpus ID (must be lowercase in registry!)
ID tmxfr
# path to binary data files
HOME /home/mdecorde/TXM/corpora/tmxtest/data/fr
# optional info file (displayed by "info;" command in CQP)
INFO /home/mdecorde/TXM/corpora/tmxtest/data/fr/.info
# corpus properties provide additional information about the corpus:
##:: charset = "utf8" # character encoding of corpus data
##:: language = "??" # insert ISO code for language (de, en, fr, ...)
##
## p-attributes (token annotations)
##
ATTRIBUTE word
ATTRIBUTE pos
ATTRIBUTE lemma
ATTRIBUTE id
##
## s-attributes (structural markup)
##
# <text base=".." project=".." id=".."> ... </text>
# (no recursive embedding allowed)
STRUCTURE text
STRUCTURE text_base # [annotations]
STRUCTURE text_project # [annotations]
STRUCTURE text_id # [annotations]
# <tu tuid=".." committee=".." vote=".." lead=".." session=".."> ... </tu>
# (no recursive embedding allowed)
STRUCTURE tu
STRUCTURE tu_tuid # [annotations]
STRUCTURE tu_committee # [annotations]
STRUCTURE tu_vote # [annotations]
STRUCTURE tu_lead # [annotations]
STRUCTURE tu_session # [annotations]
# <seg id=".."> ... </seg>
# (no recursive embedding allowed)
STRUCTURE seg
STRUCTURE seg_id # [annotations]
ALIGNED tmxen
# Yours sincerely, the Encode tool.
=============================
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb
More information about the CWB
mailing list