[CWB] Using corpora alignment feature of CWB

Hardie, Andrew a.hardie at lancaster.ac.uk
Thu Jul 22 07:12:45 CEST 2010


Hi Matthieu,

Adding ALIGNED to the registry isn't enough. I think you need to run cwb-align and then cwb-align-encode at some point in order to actually create the alignment attribute. OR, alternatively, cwb-align-import.  Alas, especially the former two are largely undocumented, the most extensive material so far is the following:

http://cwb.svn.sourceforge.net/viewvc/cwb/cwb/trunk/man/cwb-align.pod
(aka man cwb-align)
http://cwb.svn.sourceforge.net/viewvc/cwb/cwb/trunk/man/cwb-align-encode.pod
(aka man cwb-align-encode)
http://cwb.svn.sourceforge.net/viewvc/cwb/perl/trunk/CWB/script/cwb-align-import
from line 272 (aka man cwb-align-import)
http://liste.sslmit.unibo.it/pipermail/cwb/2007-February/000064.html
(An old email from Stefan which explains some of the steps in aligning)

I thnk you'll need something like 

cwb-align -S seg tmxfr tmxen seg
or
cwb-align -V seg tmxfr tmxen seg
(the former if identical segs elements are indicated by order; the latter if by identical attributes).

and then something like 

cwb-align-encode out.align

... with extra options (ie -r etc.) on some or all of these commands as necessary!

Hope this helps -- Stefan may have more to add on this issue.

best

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Matthieu Decorde
Sent: 21 July 2010 14:01
To: cwb at sslmit.unibo.it
Subject: [CWB] Using corpora alignment feature of CWB

Dear all,

I'm trying to use the corpora alignment feature of CWB.

I've built my source files (tmxfr.wtc and tmxen.wtc), and called 
cwb-encode on them.
Then I tuned the registry files by adding 'ALIGNED <theothercorpus>' in 
each registry file
in the declaration part of the <seg> structural attribute.

The following CQP script:
TMXFR;
"séance" :TMXEN "meeting";
shows :
"0 match"
And we know we should get one match.
Am I missing something or doing something wrong ?
Thanks for any reply.
Best,

Matthieu

An archive of the files I used is at :
http://mercure.ens-lsh.fr/get?k=4KVeQoloWRT5FaO8hOF

=======
The commands I used

rm -rf /home/mdecorde/TXM/corpora/tmxtest/data/fr/*
rm -rf /home/mdecorde/TXM/corpora/tmxtest/data/en/*

/home/mdecorde/TXMinstall/cwb/bin/cwb-encode -d 
/home/mdecorde/TXM/corpora/tmxtest/data/fr -f 
/home/mdecorde/TXM/corpora/tmxtest/wtc/tmxfr.wtc -R 
/home/mdecorde/TXM/corpora/tmxtest/registry/tmxfr -c utf8 -xsB -xsB -P 
pos -P lemma -P id -S text:0+base+project+id -S 
tu:0+tuid+committee+vote+lead+session -S seg:0+id

/home/mdecorde/TXMinstall/cwb/bin/cwb-encode -d 
/home/mdecorde/TXM/corpora/tmxtest/data/en -f 
/home/mdecorde/TXM/corpora/tmxtest/wtc/tmxen.wtc -R 
/home/mdecorde/TXM/corpora/tmxtest/registry/tmxen -c utf8 -xsB -xsB -P 
pos -P lemma -P id -S text:0+base+project+id -S 
tu:0+tuid+committee+vote+lead+session -S seg:0+id

/home/mdecorde/TXMinstall/cwb/bin/cwb-makeall -r 
/home/mdecorde/TXM/corpora/tmxtest/registry -V tmxfr
/home/mdecorde/TXMinstall/cwb/bin/cwb-makeall -r 
/home/mdecorde/TXM/corpora/tmxtest/registry -V tmxen


===============================

Registry file :

##
## registry entry for corpus TMXFR
##

# long descriptive name for the corpus
NAME ""
# corpus ID (must be lowercase in registry!)
ID   tmxfr
# path to binary data files
HOME /home/mdecorde/TXM/corpora/tmxtest/data/fr
# optional info file (displayed by "info;" command in CQP)
INFO /home/mdecorde/TXM/corpora/tmxtest/data/fr/.info

# corpus properties provide additional information about the corpus:
##:: charset  = "utf8" # character encoding of corpus data
##:: language = "??"     # insert ISO code for language (de, en, fr, ...)


##
## p-attributes (token annotations)
##

ATTRIBUTE word
ATTRIBUTE pos
ATTRIBUTE lemma
ATTRIBUTE id


##
## s-attributes (structural markup)
##

# <text base=".." project=".." id=".."> ... </text>
# (no recursive embedding allowed)
STRUCTURE text
STRUCTURE text_base            # [annotations]
STRUCTURE text_project         # [annotations]
STRUCTURE text_id              # [annotations]

# <tu tuid=".." committee=".." vote=".." lead=".." session=".."> ... </tu>
# (no recursive embedding allowed)
STRUCTURE tu
STRUCTURE tu_tuid              # [annotations]
STRUCTURE tu_committee         # [annotations]
STRUCTURE tu_vote              # [annotations]
STRUCTURE tu_lead              # [annotations]
STRUCTURE tu_session           # [annotations]

# <seg id=".."> ... </seg>
# (no recursive embedding allowed)
STRUCTURE seg
STRUCTURE seg_id               # [annotations]

ALIGNED tmxen

# Yours sincerely, the Encode tool.

=============================

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list