[CWB] Sample corpus for IMS Corpus Workbench
Ray Wu
liangpingwu at 126.com
Fri May 25 07:09:26 CEST 2012
I’m also interested in this topic, as I can see that a pre-indexing approach is the only(?) way to put a very big corpus online.
I did the patch work and the newDICKENS corpus can be queried from the terminal via cqp. Good work, Andrew. Now, I wanted to loaded it into CQPweb via the browser.
Here is the layout of the new DICKENS corpus on my computer:
/Dickens/DICKENS-cqbweb-edition$ ls
data registry
/Dickens/DICKENS-cqbweb-edition/registry$ ls
dickens
Two lines are changed in the registry file "dickens" to make it CQPweb compatible:
# data file directory (relative or absolute path)
HOME /home/ray/Dickens/DICKENS-cqbweb-edition/data
# optional info file (displayed by "info" command in CQP)
INFO /home/ray/Dickens/DICKENS-cqbweb-edition/data/.info
My metadata for DICKENS:
/usr/local/apache2/cqpweb_aux/upload$ cat dickens_meta.txt
ACC ACC
DC DC
DaSDas
GE GE
HT HT
MHC MHC
NN NN
OT OT
OMF OMF
BOZBOZ
ToTC ToTC
OCS OCS
PP PP
3GS3GS
When asked “Where is the registry file?” I specified “In the directory specified here:”
/home/ray/Dickens/DICKENS-cqbweb-edition/registry
After hitting "Install corpus with settings above", I got the following error message:
CQPweb encountered an error and could not continue.
The data directory specified in the registry file could not be found.
... in file /usr/local/apache2/htdocs/cqp/lib/admin-install.inc.php line 146.
I looked into :/usr/local/apache2/htdocs/cqp (my CQPweb directory) and found no directory called dickens was created. However, if I commented the said line 146 out, the dickens directory could be created in the CQPweb program directory, but there was still no index created in /usr/local/apache2/cqpweb_aux/index (my CQPweb index directory for all corpora).
After I manually moved all the DICKENS corpus's index files to the CQPweb's index directory, I could start to use the DICKENS corpus via my browser.
All things went well except the “Restricted Query”. When I tried to search the word "the" in ACC, the browser says:
Your query had no results.
There are no matches for your query.
It seems that my metadata is not recognized. I guess this might have to do with some internal changes to the DICKENS corpus not implemented by the patch work yet. Am I correct?
Best,
Ray
At 2012-05-25 07:01:49,"Hardie, Andrew" <a.hardie at lancaster.ac.uk> wrote:
Well, I was referring to editing the source, not realising that Stefan did not have it to hand. BUT you can still encode the two necessary s-attributes as extras, using the attached files.
Assuming you are in the root of the tutorial corpus, insert those two files there, and run these commands:
cwb-s-encode -d data -f text.src -S text
cwb-s-encode -d data -f text_id.src -V text_id
Then add the following lines to the registry/dickens file:
# <text id=".."> ... </text>
# (no recursive embedding allowed)
STRUCTURE text
STRUCTURE text_id # [annotations]
(in with the other s-atts).
IF all the above works successfully, the corpus should become CQPweb-compatible. You can check whether it worked as follows:
cwb-describe-corpus -r registry -sd DICKENS | less
best
Andrew.
From:cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Kurt Sultana
Sent: 24 May 2012 15:27
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Sample corpus for IMS Corpus Workbench
Thanks for your input guys.
Andrew, when you said:
adjusting the existing tutorial data to make it CQPweb-compatible is much easier, as outlined
I didn't quite get you there. The files in the /data directory seem to be encoded (I believe CWB encodes them in the process). Where should I do the changes from <novel> to <text>?
Thanks,
Kurt
On Tue, May 22, 2012 at 10:50 AM, Stefan Evert <stefanML at collocations.de> wrote:
> And if you can hang on till I and/or Stefan finds a suitable schedule hole (which alas can take a very long time as neither of us works on CWB as our main job), we’ll do it for you, as Stefan said!
I'm afraid this may have to wait until my laptop stops being dead -- apparently the motherboard is broken -- and I can get my hands on the source code of the demo corpora again. I might want to put them in a safer place then ...
I'll set a reminder to look at the issue again in early June.
Cheers,
Stefan
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20120525/0262d014/attachment-0001.htm
More information about the CWB
mailing list