[CWB] WebInABox: Can't import existing corpora from host

Scott Sadowsky ssadowsky at gmail.com
Sun Jul 24 15:09:30 CEST 2016


On Sat, Jul 23, 2016 at 3:19 PM, Hardie, Andrew <a.hardie at lancaster.ac.uk>
wrote:

Hi Andrew,

Might it be a permissions issue? Depending on how you mounted it, the Vbox
> shared folder containing the index data may not be accessible to the http
> daemon. Check with ls –l.
>
>
>
> Please check this, if it’s not this, then please post the HOME line of the
> registry in your reply, and I’ll use that to check the code
>

Thanks, Andrew. It was indeed a permissions issue. In order to troubleshoot
this (as symlinks can be tricky), I copied the index files and registry
into the CQPWiaB VM and placed them into the same directories as the BNC
sampler and Mandarin corpora. The problems persisted, so I changed
permissions and ownership as follows (replace test_flxml_corpus with the
name of your corpus):

cd /var/cqpweb/index
sudo chown www-data:www-data test_flxml_corpus
sudo chmod 755 test_flxml_corpus

cd test_flxml_corpus
sudo chown www-data:www-data *
sudo chmod 644 *

cd ../../registry/
sudo chown www-data:www-data test_flxml_corpus
sudo chmod 664 test_flxml_corpus


So now I can attempt to import the corpus, but I run into a new error:
"Pre-indexed corpora require s-attributes text and text_id!!".  I've
searched the manual included in CQPWiaB but there's no mention of
"text_id". What am I doing wrong?

Below is the content of my registry file, in case that helps.

Thanks!
Scott

##
## registry entry for corpus TEST_FLXML_CORPUS
##

# long descriptive name for the corpus
NAME "Test corpus using FreeLing XML tagger"
# corpus ID (must be lowercase in registry!)
ID test_flxml_corpus
# path to binary data files
HOME /var/cqpweb/index/test_flxml_corpus
# optional info file (displayed by "info;" command in CQP)
INFO /var/cqpweb/index/test_flxml_corpus/.info

# corpus properties provide additional information about the corpus:
##:: charset  = "utf8" # character encoding of corpus data
##:: language = "es-CL" # insert ISO code for language (de, en, fr, ...)


##
## p-attributes (token annotations)
##

ATTRIBUTE word
ATTRIBUTE lemma
ATTRIBUTE tag
ATTRIBUTE ctag
ATTRIBUTE pos
ATTRIBUTE type


##
## s-attributes (structural markup)
##

# <s> ... </s>
# (no recursive embedding allowed)
STRUCTURE s

# <id corpus=".." tagger=".." file=".." label=".." channel=".."
audience=".." purpose=".." genre=".." field=".." area=".." source=".."> ...
</id>
# (no recursive embedding allowed)
STRUCTURE id
STRUCTURE id_corpus            # [annotations]
STRUCTURE id_tagger            # [annotations]
STRUCTURE id_file              # [annotations]
STRUCTURE id_label             # [annotations]
STRUCTURE id_channel           # [annotations]
STRUCTURE id_audience          # [annotations]
STRUCTURE id_purpose           # [annotations]
STRUCTURE id_genre             # [annotations]
STRUCTURE id_field             # [annotations]
STRUCTURE id_area              # [annotations]
STRUCTURE id_source            # [annotations]


# Yours sincerely, the Encode tool.



> *From:* cwb-bounces at liste.sslmit.unibo.it [
> mailto:cwb-bounces at liste.sslmit.unibo.it
> <cwb-bounces at liste.sslmit.unibo.it>] *On Behalf Of *Scott Sadowsky
> *Sent:* 23 July 2016 20:07
> *To:* Open source development of the Corpus WorkBench
> *Subject:* [CWB] WebInABox: Can't import existing corpora from host
>
>
>
> Hi all!
>
>
>
> I'm using the brilliant CQP Web in a Box to try and install an
> already-indexed corpus. This corpus is located on my host machine, and I'm
> using VirtualBox's virtual folders to access it from within CQPWiaB. I've
> made a local copy of the registry file, placed it inside VirtualBox, and
> edited it to reflect the difference in paths between the host machine and
> the virtual machine, and everything seems to be where it should be (or
> point to where it should point). But when I go to CQPWiaB's "Install a
> corpus you have already indexed in CWB", enter the corpus's name and try to
> install it, I get one of two errors:
>
>
>
> 1. If I choose the option to look for the registry file in CQPweb's usual
> directory (which is where I've placed the modified registry file), it says:
> "A data-directory path could not be found in the registry file for the CWB
> corpus you specified. Either the data-directory is unspecified, or it is
> specified with a relative path (an absolute path is needed)".
>
>
>
> I'm using an absolute path in the registry file
> (/var/cqpweb/index/test_flxml_corpus), and all the files appear there in my
> file manager.
>
>
>
> 2. If I choose the option to specify the location of the registry and
> enter the exact same directory that CQPweb uses as its default, but
> manually, I get this error: "A corpus by that name already exists in the
> CQPweb registry!".
>
>
>
> I also get error 2 if I put in "/dev/null/" or garbage text ("asdfasdfas").
>
>
>
> Any idea what's going on?
>
>
>
> Thanks,
> Scott
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20160724/87486acd/attachment.html>


More information about the CWB mailing list