[CWB] Help with CWB under linux
Serge Heiden
slh at ens-lsh.fr
Mon Nov 30 21:53:54 CET 2009
Gh?assan,
1) From the example corpus you sent us by mail, we
can only see white spaces between properties on
each word line. Like if you had called 'untabify' on the
current selection in Emacs - I don't know which
text editor you use or if your mail client put spaces.
But if it is really the case that TABs are gone, Andrew
gives a good diagnostic to your problem.
2) Meanwhile, seeing the extension '.xml', you use
for the input file to cwb-encode, makes me think you
use an XML editor, which is maybe not a good idea :
1- because 'space' has a very special meaning in XML
and your editor could decide strange things about
tabs and spaces characters when saving the corpus
2- you MUST have TABs between properties on each
word line for CWB to work properly
One proposition : rename your corpus 'filename.txt'
and edit it with a TEXT editor to get and save tabs.
Then retry.
3) To reply to your other question, cwb-decode can
show something like :
$ cwb-decode -r . MYCORPUS -P word -P lemma
word=volunteers lemma=volunteer
word=work lemma=k
...
In your case, it should say :
$ cwb-decode -r . MYCORPUS -P word -P lemma
word=volunteers NN2 volunteer lemma=__UNDEF__
word=work VVB work lemma=__UNDEF__
...
Which is not normal, according to you, but is from CWB's
point of view.
4) It seems that you use either Linux or Cygwin versions
of CWB. Let me propose you to use, instead of the Cygwin
port, the MinGW version we have built for another project
(http://sourceforge.net/projects/textometrie/).
You will find all necessary sources and binaries at :
http://textometrie.svn.sourceforge.net/viewvc/textometrie/trunk/toolbox/src/main/C/cwb-3.0/
For binaries to work right out of SVN, you will need 'libgnurx-0.dll' and
all '.exe' files.
This is not a fork from CWB : we will give the source to the CWB project
when we have time.
Of course, if you find bugs in our Windows port, we will be very
pleased to hear about them.
Best,
Serge
----- Original Message -----
From: "Hardie, Andrew" <a.hardie at lancaster.ac.uk>
To: "Open source development of the Corpus WorkBench" <cwb at sslmit.unibo.it>
Cc: "Itai Alon" <itai at cs.technion.ac.il>
Sent: Monday, November 30, 2009 7:36 PM
Subject: RE: [CWB] Help with CWB under linux
Gassan,
It looks suspiciously as if the entire line is being encoded as a single
p-attribute rather than 3 different p-attributes, due to a problem in the
input format: it looks as if you are using spaces to delimit the colums. The
different "fields" on each line need to be delimited by a single tab in the
input file, with no spaces. CWB counts spaces as "part of the word".
In other words, you need
volunteersTABNN2TABvolunteer
or, in regex-style, volunteers\tNN2\tvolunteer
hope that helps!
best
Andrew.
________________________________
From: cwb-bounces at sslmit.unibo.it on behalf of Gassan Tabajah
Sent: Mon 30/11/2009 17:44
To: 'Open source development of the Corpus WorkBench'
Cc: 'Itai Alon'
Subject: RE: [CWB] Help with CWB under linux
Hi Serge,
My input format looks like this:
<corpus>
<text id="http://www.foo.org/index.html">
<s>
volunteers NN2 volunteer
work VVB work
as PRP as
part NN1 part
of PRF of
a AT0 a
team NN1 team
and CJC and
provide VVB provide
help NN1-VVB help
</s>
</text>
</corpus>
I used the following commands under the bin directory:
$ cwb-encode -d /usr/local/mycorpus -f filename.xml -R
/usr/local/share/cwb/registry/mycorpus -P pos -P lemma -V text -S s -S
corpus
$ cwb-makeall -V MYCORPUS
Then I run cqp -e -> MYCORPUS
When I inter a regular expression like "a.*" I got the following output:
MYCORPUS> "a.*";
2: teer work VVB work <as PRP as> part NN1 part
of
5: part of PRF of <a AT0 a> team NN1 team
and
7: a team NN1 team <and CJC and> provide VVB
provide
But when I tried something simple like "a", I got no matches:
MYCORPUS> "a";
0 matches.
I don't exactly understand why I got these results, do you have any Ideas?
What should be the output of 'cwb-decode'? Do you have an example how to use
it?
(BTW, I am using cwb-2.2.b99-RC1 version under Cygwin).
Regards,
Ghassan Tabajah
SoftWare Engineer - Mila Center
Computer Science Faculty -Technion
Room 644, Tel: (829) 3969
-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On
Behalf Of Serge HEIDEN
Sent: Monday, November 30, 2009 6:58 PM
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Help with CWB under linux
Dear Ghassan,
From: "Gassan Tabajah" <gtabajah at cs.technion.ac.il>
>> Also I noticed that the following files under "mycorpus" directory:
>> lemma.corpus, pos.corpus, word.corpus includes only <nul>'s (Is that
>> an error !?)
Yes, this is an error.
Try to use the 'cwb-decode' tool to decode your indexes independently
of using them from 'cqp'.
It seems that your 'cwb-encode' or 'cwb-makeall' process had a problem.
Are you sure of your input format ? Have you an exerpt of it ?
Best,
Serge
--
Dr. Serge Heiden, slh at ens-lsh.fr, http://textometrie.ens-lsh.fr
<http://textometrie.ens-lsh.fr/>
ENS-LSH/CNRS - ICAR UMR5191, Institut de Linguistique Française
15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>
More information about the CWB
mailing list