[CWB] Help with CWB under linux
Gassan Tabajah
gtabajah at cs.technion.ac.il
Mon Nov 30 18:44:37 CET 2009
Hi Serge,
My input format looks like this:
<corpus>
<text id="http://www.foo.org/index.html">
<s>
volunteers NN2 volunteer
work VVB work
as PRP as
part NN1 part
of PRF of
a AT0 a
team NN1 team
and CJC and
provide VVB provide
help NN1-VVB help
</s>
</text>
</corpus>
I used the following commands under the bin directory:
$ cwb-encode -d /usr/local/mycorpus -f filename.xml -R
/usr/local/share/cwb/registry/mycorpus -P pos -P lemma -V text -S s -S
corpus
$ cwb-makeall -V MYCORPUS
Then I run cqp -e -> MYCORPUS
When I inter a regular expression like "a.*" I got the following output:
MYCORPUS> "a.*";
2: teer work VVB work <as PRP as> part NN1 part
of
5: part of PRF of <a AT0 a> team NN1 team
and
7: a team NN1 team <and CJC and> provide VVB
provide
But when I tried something simple like "a", I got no matches:
MYCORPUS> "a";
0 matches.
I don't exactly understand why I got these results, do you have any Ideas?
What should be the output of 'cwb-decode'? Do you have an example how to use
it?
(BTW, I am using cwb-2.2.b99-RC1 version under Cygwin).
Regards,
Ghassan Tabajah
SoftWare Engineer - Mila Center
Computer Science Faculty -Technion
Room 644, Tel: (829) 3969
-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On
Behalf Of Serge HEIDEN
Sent: Monday, November 30, 2009 6:58 PM
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Help with CWB under linux
Dear Ghassan,
From: "Gassan Tabajah" <gtabajah at cs.technion.ac.il>
>> Also I noticed that the following files under "mycorpus" directory:
>> lemma.corpus, pos.corpus, word.corpus includes only <nul>'s (Is that
>> an error !?)
Yes, this is an error.
Try to use the 'cwb-decode' tool to decode your indexes independently
of using them from 'cqp'.
It seems that your 'cwb-encode' or 'cwb-makeall' process had a problem.
Are you sure of your input format ? Have you an exerpt of it ?
Best,
Serge
--
Dr. Serge Heiden, slh at ens-lsh.fr, http://textometrie.ens-lsh.fr
ENS-LSH/CNRS - ICAR UMR5191, Institut de Linguistique Française
15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb
More information about the CWB
mailing list