[CWB] Using corpora alignment feature of CWB

Wed Jul 21 15:01:18 CEST 2010

Dear all,

I'm trying to use the corpora alignment feature of CWB.

I've built my source files (tmxfr.wtc and tmxen.wtc), and called 
cwb-encode on them.
Then I tuned the registry files by adding 'ALIGNED <theothercorpus>' in 
each registry file
in the declaration part of the <seg> structural attribute.

The following CQP script:
TMXFR;
"séance" :TMXEN "meeting";
shows :
"0 match"
And we know we should get one match.
Am I missing something or doing something wrong ?
Thanks for any reply.
Best,

Matthieu

An archive of the files I used is at :
http://mercure.ens-lsh.fr/get?k=4KVeQoloWRT5FaO8hOF

=======
The commands I used

rm -rf /home/mdecorde/TXM/corpora/tmxtest/data/fr/*
rm -rf /home/mdecorde/TXM/corpora/tmxtest/data/en/*

/home/mdecorde/TXMinstall/cwb/bin/cwb-encode -d 
/home/mdecorde/TXM/corpora/tmxtest/data/fr -f 
/home/mdecorde/TXM/corpora/tmxtest/wtc/tmxfr.wtc -R 
/home/mdecorde/TXM/corpora/tmxtest/registry/tmxfr -c utf8 -xsB -xsB -P 
pos -P lemma -P id -S text:0+base+project+id -S 
tu:0+tuid+committee+vote+lead+session -S seg:0+id

/home/mdecorde/TXMinstall/cwb/bin/cwb-encode -d 
/home/mdecorde/TXM/corpora/tmxtest/data/en -f 
/home/mdecorde/TXM/corpora/tmxtest/wtc/tmxen.wtc -R 
/home/mdecorde/TXM/corpora/tmxtest/registry/tmxen -c utf8 -xsB -xsB -P 
pos -P lemma -P id -S text:0+base+project+id -S 
tu:0+tuid+committee+vote+lead+session -S seg:0+id

/home/mdecorde/TXMinstall/cwb/bin/cwb-makeall -r 
/home/mdecorde/TXM/corpora/tmxtest/registry -V tmxfr
/home/mdecorde/TXMinstall/cwb/bin/cwb-makeall -r 
/home/mdecorde/TXM/corpora/tmxtest/registry -V tmxen

===============================

Registry file :

##
## registry entry for corpus TMXFR
##

# long descriptive name for the corpus
NAME ""
# corpus ID (must be lowercase in registry!)
ID   tmxfr
# path to binary data files
HOME /home/mdecorde/TXM/corpora/tmxtest/data/fr
# optional info file (displayed by "info;" command in CQP)
INFO /home/mdecorde/TXM/corpora/tmxtest/data/fr/.info

# corpus properties provide additional information about the corpus:
##:: charset  = "utf8" # character encoding of corpus data
##:: language = "??"     # insert ISO code for language (de, en, fr, ...)

##
## p-attributes (token annotations)
##

ATTRIBUTE word
ATTRIBUTE pos
ATTRIBUTE lemma
ATTRIBUTE id

##
## s-attributes (structural markup)
##

# <text base=".." project=".." id=".."> ... </text>
# (no recursive embedding allowed)
STRUCTURE text
STRUCTURE text_base            # [annotations]
STRUCTURE text_project         # [annotations]
STRUCTURE text_id              # [annotations]

# <tu tuid=".." committee=".." vote=".." lead=".." session=".."> ... </tu>
# (no recursive embedding allowed)
STRUCTURE tu
STRUCTURE tu_tuid              # [annotations]
STRUCTURE tu_committee         # [annotations]
STRUCTURE tu_vote              # [annotations]
STRUCTURE tu_lead              # [annotations]
STRUCTURE tu_session           # [annotations]

# <seg id=".."> ... </seg>
# (no recursive embedding allowed)
STRUCTURE seg
STRUCTURE seg_id               # [annotations]

ALIGNED tmxen

# Yours sincerely, the Encode tool.

=============================