[CWB] Help with CWB under linux

Hardie, Andrew a.hardie at lancaster.ac.uk
Mon Nov 30 19:36:26 CET 2009


Gassan,
 
It looks suspiciously as if the entire line is being encoded as a single p-attribute rather than 3 different p-attributes, due to a problem in the input format: it looks as if you are using spaces to delimit the colums. The different "fields" on each line need to be delimited by a single tab in the input file, with no spaces. CWB counts spaces as "part of the word".
 
In other words, you need
 
volunteersTABNN2TABvolunteer
 
or, in regex-style, volunteers\tNN2\tvolunteer
 
hope that helps!
 
best
 
Andrew.

________________________________

From: cwb-bounces at sslmit.unibo.it on behalf of Gassan Tabajah
Sent: Mon 30/11/2009 17:44
To: 'Open source development of the Corpus WorkBench'
Cc: 'Itai Alon'
Subject: RE: [CWB] Help with CWB under linux



Hi Serge,

My input format looks like this:
<corpus>
<text id="http://www.foo.org/index.html">
<s>
volunteers      NN2     volunteer
work    VVB     work
as      PRP     as
part    NN1     part
of      PRF     of
a       AT0     a
team    NN1     team
and     CJC     and
provide VVB     provide
help    NN1-VVB help
</s>
</text>
</corpus>

I used the following commands under the bin directory:
$ cwb-encode -d /usr/local/mycorpus -f filename.xml -R
/usr/local/share/cwb/registry/mycorpus -P pos -P lemma -V text -S s -S
corpus
$ cwb-makeall -V MYCORPUS

Then I run cqp -e -> MYCORPUS
When I inter a regular expression like "a.*" I got the following output:
MYCORPUS> "a.*";
        2: teer work    VVB     work <as      PRP     as> part    NN1   part
of
        5:   part of      PRF     of <a       AT0     a> team    NN1    team
and
        7:    a team    NN1     team <and     CJC     and> provide VVB
provide

But when I tried something simple like "a", I got no matches:
MYCORPUS> "a";
0 matches.

I don't exactly understand why I got these results, do you have any Ideas?
What should be the output of 'cwb-decode'? Do you have an example how to use
it?
(BTW, I am using cwb-2.2.b99-RC1 version under Cygwin).


Regards,
Ghassan Tabajah
SoftWare Engineer  - Mila Center
Computer Science  Faculty -Technion
Room 644, Tel: (829) 3969

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On
Behalf Of Serge HEIDEN
Sent: Monday, November 30, 2009 6:58 PM
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Help with CWB under linux

Dear Ghassan,

From: "Gassan Tabajah" <gtabajah at cs.technion.ac.il>
>> Also I noticed that the following files under "mycorpus" directory:
>> lemma.corpus, pos.corpus, word.corpus includes only <nul>'s (Is that
>> an error !?)

Yes, this is an error.
Try to use the 'cwb-decode' tool to decode your indexes independently
of using them from 'cqp'.
It seems that your 'cwb-encode' or 'cwb-makeall' process had a problem.
Are you sure of your input format ? Have you an exerpt of it ?

Best,
Serge

--
Dr. Serge Heiden, slh at ens-lsh.fr, http://textometrie.ens-lsh.fr <http://textometrie.ens-lsh.fr/> 
ENS-LSH/CNRS - ICAR UMR5191, Institut de Linguistique Française
15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb


-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/ms-tnef
Size: 8040 bytes
Desc: not available
Url : http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20091130/c7b448a3/attachment-0001.bin


More information about the CWB mailing list