[CWB] Sample corpus for IMS Corpus Workbench

Ray Wu liangpingwu at 126.com
Fri May 25 07:09:26 CEST 2012


I’m also interested in this topic, as I can see that a pre-indexing approach is the only(?) way to put a very big corpus online.

 

I did the patch work and the newDICKENS corpus can be queried from the terminal via cqp. Good work, Andrew. Now, I wanted to loaded it into CQPweb via the browser.

 

Here is the layout of the new DICKENS corpus on my computer:

/Dickens/DICKENS-cqbweb-edition$ ls

data  registry

 

/Dickens/DICKENS-cqbweb-edition/registry$ ls

dickens

 

Two lines are changed in the registry file "dickens" to make it CQPweb compatible:

# data file directory (relative or absolute path)

HOME /home/ray/Dickens/DICKENS-cqbweb-edition/data

# optional info file (displayed by "info" command in CQP)

INFO /home/ray/Dickens/DICKENS-cqbweb-edition/data/.info

 

My metadata for DICKENS:

/usr/local/apache2/cqpweb_aux/upload$ cat dickens_meta.txt

ACC       ACC      

DC  DC

DaSDas

GE   GE

HT   HT

MHC       MHC

NN  NN

OT  OT

OMF       OMF

BOZBOZ

ToTC      ToTC

OCS       OCS

PP   PP

3GS3GS

 

When asked “Where is the registry file?” I specified “In the directory specified here:”

/home/ray/Dickens/DICKENS-cqbweb-edition/registry

 

After hitting "Install corpus with settings above", I got the following error message:

CQPweb encountered an error and could not continue.

The data directory specified in the registry file could not be found.

... in file /usr/local/apache2/htdocs/cqp/lib/admin-install.inc.php line 146.

 

I looked into :/usr/local/apache2/htdocs/cqp (my CQPweb directory) and found no directory called dickens was created. However, if I commented the said line 146 out, the dickens directory could be created in the CQPweb program directory, but there was still no index created in /usr/local/apache2/cqpweb_aux/index (my CQPweb index directory for all corpora).

 

After I manually moved all the DICKENS corpus's index files to the CQPweb's index directory, I could start to use the DICKENS corpus via my browser.

 

All things went well except the “Restricted Query”. When I tried to search the word "the" in ACC, the browser says:

Your query had no results.

There are no matches for your query.

 

It seems that my metadata is not recognized. I guess this might have to do with some internal changes to the DICKENS corpus not implemented by the patch work yet. Am I correct?

 

Best,

Ray


At 2012-05-25 07:01:49,"Hardie, Andrew" <a.hardie at lancaster.ac.uk> wrote:


Well, I was referring to editing the source, not realising that Stefan did not have it to hand. BUT you can still encode the two necessary s-attributes as extras, using the attached files.

 

Assuming you are in the root of the tutorial corpus, insert those two files there, and run these commands:

 

cwb-s-encode -d data -f text.src -S text

cwb-s-encode -d data -f text_id.src -V text_id

 

Then add the following lines to the registry/dickens file:

 

# <text id=".."> ... </text>

# (no recursive embedding allowed)

STRUCTURE text

STRUCTURE text_id              # [annotations]

 

(in with the other s-atts).

 

IF all the above works successfully, the corpus should become CQPweb-compatible. You can check whether it worked as follows:

 

cwb-describe-corpus -r registry -sd DICKENS | less

 

best

 

Andrew.

 

From:cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Kurt Sultana
Sent: 24 May 2012 15:27
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Sample corpus for IMS Corpus Workbench

 

Thanks for your input guys. 

 

Andrew, when you said:

 

adjusting the existing tutorial data to make it CQPweb-compatible is much easier, as outlined

 

I didn't quite get you there. The files in the /data directory seem to be encoded (I believe CWB encodes them in the process). Where should I do the changes from <novel> to <text>?

 

Thanks,

Kurt

 

On Tue, May 22, 2012 at 10:50 AM, Stefan Evert <stefanML at collocations.de> wrote:


> And if you can hang on till I and/or Stefan finds a suitable schedule hole (which alas can take a very long time as neither of us works on CWB as our main job), we’ll do it for you, as Stefan said!

I'm afraid this may have to wait until my laptop stops being dead -- apparently the motherboard is broken -- and I can get my hands on the source code of the demo corpora again.  I might want to put them in a safer place then ...

I'll set a reminder to look at the issue again in early June.

Cheers,
Stefan


_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20120525/0262d014/attachment-0001.htm


More information about the CWB mailing list