[CWB] Using corpora alignment feature of CWB
Matthieu Decorde
matthieu.decorde at ens-lyon.fr
Wed Jul 21 15:01:18 CEST 2010
Dear all,
I'm trying to use the corpora alignment feature of CWB.
I've built my source files (tmxfr.wtc and tmxen.wtc), and called
cwb-encode on them.
Then I tuned the registry files by adding 'ALIGNED <theothercorpus>' in
each registry file
in the declaration part of the <seg> structural attribute.
The following CQP script:
TMXFR;
"séance" :TMXEN "meeting";
shows :
"0 match"
And we know we should get one match.
Am I missing something or doing something wrong ?
Thanks for any reply.
Best,
Matthieu
An archive of the files I used is at :
http://mercure.ens-lsh.fr/get?k=4KVeQoloWRT5FaO8hOF
=======
The commands I used
rm -rf /home/mdecorde/TXM/corpora/tmxtest/data/fr/*
rm -rf /home/mdecorde/TXM/corpora/tmxtest/data/en/*
/home/mdecorde/TXMinstall/cwb/bin/cwb-encode -d
/home/mdecorde/TXM/corpora/tmxtest/data/fr -f
/home/mdecorde/TXM/corpora/tmxtest/wtc/tmxfr.wtc -R
/home/mdecorde/TXM/corpora/tmxtest/registry/tmxfr -c utf8 -xsB -xsB -P
pos -P lemma -P id -S text:0+base+project+id -S
tu:0+tuid+committee+vote+lead+session -S seg:0+id
/home/mdecorde/TXMinstall/cwb/bin/cwb-encode -d
/home/mdecorde/TXM/corpora/tmxtest/data/en -f
/home/mdecorde/TXM/corpora/tmxtest/wtc/tmxen.wtc -R
/home/mdecorde/TXM/corpora/tmxtest/registry/tmxen -c utf8 -xsB -xsB -P
pos -P lemma -P id -S text:0+base+project+id -S
tu:0+tuid+committee+vote+lead+session -S seg:0+id
/home/mdecorde/TXMinstall/cwb/bin/cwb-makeall -r
/home/mdecorde/TXM/corpora/tmxtest/registry -V tmxfr
/home/mdecorde/TXMinstall/cwb/bin/cwb-makeall -r
/home/mdecorde/TXM/corpora/tmxtest/registry -V tmxen
===============================
Registry file :
##
## registry entry for corpus TMXFR
##
# long descriptive name for the corpus
NAME ""
# corpus ID (must be lowercase in registry!)
ID tmxfr
# path to binary data files
HOME /home/mdecorde/TXM/corpora/tmxtest/data/fr
# optional info file (displayed by "info;" command in CQP)
INFO /home/mdecorde/TXM/corpora/tmxtest/data/fr/.info
# corpus properties provide additional information about the corpus:
##:: charset = "utf8" # character encoding of corpus data
##:: language = "??" # insert ISO code for language (de, en, fr, ...)
##
## p-attributes (token annotations)
##
ATTRIBUTE word
ATTRIBUTE pos
ATTRIBUTE lemma
ATTRIBUTE id
##
## s-attributes (structural markup)
##
# <text base=".." project=".." id=".."> ... </text>
# (no recursive embedding allowed)
STRUCTURE text
STRUCTURE text_base # [annotations]
STRUCTURE text_project # [annotations]
STRUCTURE text_id # [annotations]
# <tu tuid=".." committee=".." vote=".." lead=".." session=".."> ... </tu>
# (no recursive embedding allowed)
STRUCTURE tu
STRUCTURE tu_tuid # [annotations]
STRUCTURE tu_committee # [annotations]
STRUCTURE tu_vote # [annotations]
STRUCTURE tu_lead # [annotations]
STRUCTURE tu_session # [annotations]
# <seg id=".."> ... </seg>
# (no recursive embedding allowed)
STRUCTURE seg
STRUCTURE seg_id # [annotations]
ALIGNED tmxen
# Yours sincerely, the Encode tool.
=============================
More information about the CWB
mailing list