[CWB] WebInABox: Can't import existing corpora from host

Hardie, Andrew a.hardie at lancaster.ac.uk
Sun Jul 24 16:19:19 CEST 2016


CQPweb requires all corpora to have at least one <text> element, and every text element has to have an id i.e. everything within the corpus has to be contained within a sequence of one or more

<text id=”somethinghere”>
…
</text>

If you’ve indexed a corpus in CWB that doesn’t have these, you need to add it (it’s fine to have just one <text>…</text> for the whole corpus if you don’t have any meaningful text divisions) before importing into CQPweb.

You can do this with cwb-s-encode.

The “how to index a corpus” chapter in the manual is incomplete –n it does mention that <text> is compulsory but doesn’t spell out the implications. See chap 6 here http://cwb.sourceforge.net/files/CQPwebAdminManual.pdf (the version in CQPwebInABox may or may not have the same chapter numbering, it’s a few months older)

best

Andrew.


From: cwb-bounces at liste.sslmit.unibo.it [mailto:cwb-bounces at liste.sslmit.unibo.it] On Behalf Of Scott Sadowsky
Sent: 24 July 2016 14:10
To: Open source development of the Corpus WorkBench; CWBdev Mailing List
Subject: Re: [CWB] WebInABox: Can't import existing corpora from host

On Sat, Jul 23, 2016 at 3:19 PM, Hardie, Andrew <a.hardie at lancaster.ac.uk<mailto:a.hardie at lancaster.ac.uk>> wrote:

Hi Andrew,

Might it be a permissions issue? Depending on how you mounted it, the Vbox shared folder containing the index data may not be accessible to the http daemon. Check with ls –l.

Please check this, if it’s not this, then please post the HOME line of the registry in your reply, and I’ll use that to check the code

Thanks, Andrew. It was indeed a permissions issue. In order to troubleshoot this (as symlinks can be tricky), I copied the index files and registry into the CQPWiaB VM and placed them into the same directories as the BNC sampler and Mandarin corpora. The problems persisted, so I changed permissions and ownership as follows (replace test_flxml_corpus with the name of your corpus):

cd /var/cqpweb/index
sudo chown www-data:www-data test_flxml_corpus
sudo chmod 755 test_flxml_corpus

cd test_flxml_corpus
sudo chown www-data:www-data *
sudo chmod 644 *

cd ../../registry/
sudo chown www-data:www-data test_flxml_corpus
sudo chmod 664 test_flxml_corpus


So now I can attempt to import the corpus, but I run into a new error: "Pre-indexed corpora require s-attributes text and text_id!!".  I've searched the manual included in CQPWiaB but there's no mention of "text_id". What am I doing wrong?

Below is the content of my registry file, in case that helps.

Thanks!
Scott

##
## registry entry for corpus TEST_FLXML_CORPUS
##

# long descriptive name for the corpus
NAME "Test corpus using FreeLing XML tagger"
# corpus ID (must be lowercase in registry!)
ID test_flxml_corpus
# path to binary data files
HOME /var/cqpweb/index/test_flxml_corpus
# optional info file (displayed by "info;" command in CQP)
INFO /var/cqpweb/index/test_flxml_corpus/.info

# corpus properties provide additional information about the corpus:
##:: charset  = "utf8" # character encoding of corpus data
##:: language = "es-CL" # insert ISO code for language (de, en, fr, ...)


##
## p-attributes (token annotations)
##

ATTRIBUTE word
ATTRIBUTE lemma
ATTRIBUTE tag
ATTRIBUTE ctag
ATTRIBUTE pos
ATTRIBUTE type


##
## s-attributes (structural markup)
##

# <s> ... </s>
# (no recursive embedding allowed)
STRUCTURE s

# <id corpus=".." tagger=".." file=".." label=".." channel=".." audience=".." purpose=".." genre=".." field=".." area=".." source=".."> ... </id>
# (no recursive embedding allowed)
STRUCTURE id
STRUCTURE id_corpus            # [annotations]
STRUCTURE id_tagger            # [annotations]
STRUCTURE id_file              # [annotations]
STRUCTURE id_label             # [annotations]
STRUCTURE id_channel           # [annotations]
STRUCTURE id_audience          # [annotations]
STRUCTURE id_purpose           # [annotations]
STRUCTURE id_genre             # [annotations]
STRUCTURE id_field             # [annotations]
STRUCTURE id_area              # [annotations]
STRUCTURE id_source            # [annotations]


# Yours sincerely, the Encode tool.


From: cwb-bounces at liste.sslmit.unibo.it<mailto:cwb-bounces at liste.sslmit.unibo.it> [mailto:cwb-bounces at liste.sslmit.unibo.it] On Behalf Of Scott Sadowsky
Sent: 23 July 2016 20:07
To: Open source development of the Corpus WorkBench
Subject: [CWB] WebInABox: Can't import existing corpora from host

Hi all!

I'm using the brilliant CQP Web in a Box to try and install an already-indexed corpus. This corpus is located on my host machine, and I'm using VirtualBox's virtual folders to access it from within CQPWiaB. I've made a local copy of the registry file, placed it inside VirtualBox, and edited it to reflect the difference in paths between the host machine and the virtual machine, and everything seems to be where it should be (or point to where it should point). But when I go to CQPWiaB's "Install a corpus you have already indexed in CWB", enter the corpus's name and try to install it, I get one of two errors:

1. If I choose the option to look for the registry file in CQPweb's usual directory (which is where I've placed the modified registry file), it says: "A data-directory path could not be found in the registry file for the CWB corpus you specified. Either the data-directory is unspecified, or it is specified with a relative path (an absolute path is needed)".

I'm using an absolute path in the registry file (/var/cqpweb/index/test_flxml_corpus), and all the files appear there in my file manager.

2. If I choose the option to specify the location of the registry and enter the exact same directory that CQPweb uses as its default, but manually, I get this error: "A corpus by that name already exists in the CQPweb registry!".

I also get error 2 if I put in "/dev/null/" or garbage text ("asdfasdfas").

Any idea what's going on?

Thanks,
Scott
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20160724/f522c83e/attachment-0001.html>


More information about the CWB mailing list