[CWB] Help with CWB under linux

Serge Heiden slh at ens-lsh.fr
Mon Nov 30 21:53:54 CET 2009


Gh?assan,

1) From the example corpus you sent us by mail, we
can only see white spaces between properties on
each word line. Like if you had called 'untabify' on the
current selection in Emacs - I don't know which
text editor you use or if your mail client put spaces.
But if it is really the case that TABs are gone, Andrew
gives a good diagnostic to your problem.

2) Meanwhile, seeing the extension '.xml', you use
for the input file to cwb-encode, makes me think you
use an XML editor, which is maybe not a good idea :
1- because 'space' has a very special meaning in XML
and your editor could decide strange things about
tabs and spaces characters when saving the corpus
2- you MUST have TABs between properties on each
word line for CWB to work properly
One proposition : rename your corpus 'filename.txt'
and edit it with a TEXT editor to get and save tabs.
Then retry.

3) To reply to your other question, cwb-decode can
show something like :
$ cwb-decode -r . MYCORPUS -P word -P lemma
word=volunteers lemma=volunteer
word=work       lemma=k
...

In your case, it should say :
$ cwb-decode -r . MYCORPUS -P word -P lemma
word=volunteers    NN2    volunteer     lemma=__UNDEF__
word=work    VVB    work        lemma=__UNDEF__
...

Which is not normal, according to you, but is from CWB's
point of view.

4) It seems that you use either Linux or Cygwin versions
of CWB. Let me propose you to use, instead of the Cygwin
port, the MinGW version we have built for another project
(http://sourceforge.net/projects/textometrie/).
You will find all necessary sources and binaries at :
http://textometrie.svn.sourceforge.net/viewvc/textometrie/trunk/toolbox/src/main/C/cwb-3.0/
For binaries to work right out of SVN, you will need 'libgnurx-0.dll' and 
all '.exe' files.
This is not a fork from CWB : we will give the source to the CWB project 
when we have time.
Of course, if you find bugs in our Windows port, we will be very
pleased to hear about them.

Best,
Serge


----- Original Message ----- 
From: "Hardie, Andrew" <a.hardie at lancaster.ac.uk>
To: "Open source development of the Corpus WorkBench" <cwb at sslmit.unibo.it>
Cc: "Itai Alon" <itai at cs.technion.ac.il>
Sent: Monday, November 30, 2009 7:36 PM
Subject: RE: [CWB] Help with CWB under linux


Gassan,

It looks suspiciously as if the entire line is being encoded as a single 
p-attribute rather than 3 different p-attributes, due to a problem in the 
input format: it looks as if you are using spaces to delimit the colums. The 
different "fields" on each line need to be delimited by a single tab in the 
input file, with no spaces. CWB counts spaces as "part of the word".

In other words, you need

volunteersTABNN2TABvolunteer

or, in regex-style, volunteers\tNN2\tvolunteer

hope that helps!

best

Andrew.

________________________________

From: cwb-bounces at sslmit.unibo.it on behalf of Gassan Tabajah
Sent: Mon 30/11/2009 17:44
To: 'Open source development of the Corpus WorkBench'
Cc: 'Itai Alon'
Subject: RE: [CWB] Help with CWB under linux



Hi Serge,

My input format looks like this:
<corpus>
<text id="http://www.foo.org/index.html">
<s>
volunteers      NN2     volunteer
work    VVB     work
as      PRP     as
part    NN1     part
of      PRF     of
a       AT0     a
team    NN1     team
and     CJC     and
provide VVB     provide
help    NN1-VVB help
</s>
</text>
</corpus>

I used the following commands under the bin directory:
$ cwb-encode -d /usr/local/mycorpus -f filename.xml -R
/usr/local/share/cwb/registry/mycorpus -P pos -P lemma -V text -S s -S
corpus
$ cwb-makeall -V MYCORPUS

Then I run cqp -e -> MYCORPUS
When I inter a regular expression like "a.*" I got the following output:
MYCORPUS> "a.*";
        2: teer work    VVB     work <as      PRP     as> part    NN1   part
of
        5:   part of      PRF     of <a       AT0     a> team    NN1    team
and
        7:    a team    NN1     team <and     CJC     and> provide VVB
provide

But when I tried something simple like "a", I got no matches:
MYCORPUS> "a";
0 matches.

I don't exactly understand why I got these results, do you have any Ideas?
What should be the output of 'cwb-decode'? Do you have an example how to use
it?
(BTW, I am using cwb-2.2.b99-RC1 version under Cygwin).


Regards,
Ghassan Tabajah
SoftWare Engineer  - Mila Center
Computer Science  Faculty -Technion
Room 644, Tel: (829) 3969

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On
Behalf Of Serge HEIDEN
Sent: Monday, November 30, 2009 6:58 PM
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Help with CWB under linux

Dear Ghassan,

From: "Gassan Tabajah" <gtabajah at cs.technion.ac.il>
>> Also I noticed that the following files under "mycorpus" directory:
>> lemma.corpus, pos.corpus, word.corpus includes only <nul>'s (Is that
>> an error !?)

Yes, this is an error.
Try to use the 'cwb-decode' tool to decode your indexes independently
of using them from 'cqp'.
It seems that your 'cwb-encode' or 'cwb-makeall' process had a problem.
Are you sure of your input format ? Have you an exerpt of it ?

Best,
Serge

--
Dr. Serge Heiden, slh at ens-lsh.fr, http://textometrie.ens-lsh.fr 
<http://textometrie.ens-lsh.fr/>
ENS-LSH/CNRS - ICAR UMR5191, Institut de Linguistique Française
15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb






> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> 




More information about the CWB mailing list