[CWB] WebInABox: Can't import existing corpora from host

Hardie, Andrew a.hardie at lancaster.ac.uk
Sun Jul 24 17:29:38 CEST 2016


First point – your text ID codes won’t work, they need to be handles, i.e. just ASCII letters, numbers, and underscore – no hyphens/full stops.

Second point – the various s-attributes text_corpus , text_tagger etc. need (a) to exist in the registry – did your correction fix this? (b) CQPweb needs to have logged their existence – if it’s saying “No XML annotations found” that suggests it hasn’t, which could be a consequence of (a), or could be a bug.

There was in fact a bug with s-attributes in the registry failing to be detected which I fixed a few months back: I cannot recall if that was before or after the version of the code in the VM image. If you want to rule this out, connect the VM’s networking, upgrade CQPweb to the latest version from SVN (don’t forget to do the database upgrade!), and try again: if that fixes it, it was the old bug.

Once CQPweb is aware of your XML attributes you should be able to use them to derive text metadata.

best

Andrew.

From: cwb-bounces at liste.sslmit.unibo.it [mailto:cwb-bounces at liste.sslmit.unibo.it] On Behalf Of Scott Sadowsky
Sent: 24 July 2016 15:52
To: CWBdev Mailing List
Subject: [CWB] WebInABox: Can't import existing corpora from host

On Sun, Jul 24, 2016 at 10:19 AM, Hardie, Andrew <a.hardie at lancaster.ac.uk<mailto:a.hardie at lancaster.ac.uk>> wrote:

CQPweb requires all corpora to have at least one <text> element, and every text element has to have an id i.e. everything within the corpus has to be contained within a sequence of one or more

<text id=”somethinghere”>
…
</text>

Thanks, Andrew. It turns out the problem was that I had been using the name "id" instead of "text" for the element. Now that I've changed that, I was able to successfully create the corpus in CQPweb.

My source files have quite a bit of metadata, which I've encoded as follows:

<text id="CCN-F2-02_D_StB.ortografica.txt" corpus="test" tagger="freeling-xml" language="spanish" location="concepcion" sex="f">
...
</text>

I'm now at the CQPweb "Design and insert a text-metadata table for the corpus" page, but it tells me that "No XML annotations found for this corpus". Is there something wrong with how I did the encoding above? I can use all of these XML elements in cqp searches directly, but here they aren't recognized.

(I've checked chapter 6 of the manual, to no avail).

Best wishes,
Scott



From: cwb-bounces at liste.sslmit.unibo.it<mailto:cwb-bounces at liste.sslmit.unibo.it> [mailto:cwb-bounces at liste.sslmit.unibo.it<mailto:cwb-bounces at liste.sslmit.unibo.it>] On Behalf Of Scott Sadowsky
Sent: 24 July 2016 14:10
To: Open source development of the Corpus WorkBench; CWBdev Mailing List
Subject: Re: [CWB] WebInABox: Can't import existing corpora from host

On Sat, Jul 23, 2016 at 3:19 PM, Hardie, Andrew <a.hardie at lancaster.ac.uk<mailto:a.hardie at lancaster.ac.uk>> wrote:

Hi Andrew,

Might it be a permissions issue? Depending on how you mounted it, the Vbox shared folder containing the index data may not be accessible to the http daemon. Check with ls –l.

Please check this, if it’s not this, then please post the HOME line of the registry in your reply, and I’ll use that to check the code

Thanks, Andrew. It was indeed a permissions issue. In order to troubleshoot this (as symlinks can be tricky), I copied the index files and registry into the CQPWiaB VM and placed them into the same directories as the BNC sampler and Mandarin corpora. The problems persisted, so I changed permissions and ownership as follows (replace test_flxml_corpus with the name of your corpus):

cd /var/cqpweb/index
sudo chown www-data:www-data test_flxml_corpus
sudo chmod 755 test_flxml_corpus

cd test_flxml_corpus
sudo chown www-data:www-data *
sudo chmod 644 *

cd ../../registry/
sudo chown www-data:www-data test_flxml_corpus
sudo chmod 664 test_flxml_corpus


So now I can attempt to import the corpus, but I run into a new error: "Pre-indexed corpora require s-attributes text and text_id!!".  I've searched the manual included in CQPWiaB but there's no mention of "text_id". What am I doing wrong?

Below is the content of my registry file, in case that helps.

Thanks!
Scott

##
## registry entry for corpus TEST_FLXML_CORPUS
##

# long descriptive name for the corpus
NAME "Test corpus using FreeLing XML tagger"
# corpus ID (must be lowercase in registry!)
ID test_flxml_corpus
# path to binary data files
HOME /var/cqpweb/index/test_flxml_corpus
# optional info file (displayed by "info;" command in CQP)
INFO /var/cqpweb/index/test_flxml_corpus/.info

# corpus properties provide additional information about the corpus:
##:: charset  = "utf8" # character encoding of corpus data
##:: language = "es-CL" # insert ISO code for language (de, en, fr, ...)


##
## p-attributes (token annotations)
##

ATTRIBUTE word
ATTRIBUTE lemma
ATTRIBUTE tag
ATTRIBUTE ctag
ATTRIBUTE pos
ATTRIBUTE type


##
## s-attributes (structural markup)
##

# <s> ... </s>
# (no recursive embedding allowed)
STRUCTURE s

# <id corpus=".." tagger=".." file=".." label=".." channel=".." audience=".." purpose=".." genre=".." field=".." area=".." source=".."> ... </id>
# (no recursive embedding allowed)
STRUCTURE id
STRUCTURE id_corpus            # [annotations]
STRUCTURE id_tagger            # [annotations]
STRUCTURE id_file              # [annotations]
STRUCTURE id_label             # [annotations]
STRUCTURE id_channel           # [annotations]
STRUCTURE id_audience          # [annotations]
STRUCTURE id_purpose           # [annotations]
STRUCTURE id_genre             # [annotations]
STRUCTURE id_field             # [annotations]
STRUCTURE id_area              # [annotations]
STRUCTURE id_source            # [annotations]


# Yours sincerely, the Encode tool.


From: cwb-bounces at liste.sslmit.unibo.it<mailto:cwb-bounces at liste.sslmit.unibo.it> [mailto:cwb-bounces at liste.sslmit.unibo.it] On Behalf Of Scott Sadowsky
Sent: 23 July 2016 20:07
To: Open source development of the Corpus WorkBench
Subject: [CWB] WebInABox: Can't import existing corpora from host

Hi all!

I'm using the brilliant CQP Web in a Box to try and install an already-indexed corpus. This corpus is located on my host machine, and I'm using VirtualBox's virtual folders to access it from within CQPWiaB. I've made a local copy of the registry file, placed it inside VirtualBox, and edited it to reflect the difference in paths between the host machine and the virtual machine, and everything seems to be where it should be (or point to where it should point). But when I go to CQPWiaB's "Install a corpus you have already indexed in CWB", enter the corpus's name and try to install it, I get one of two errors:

1. If I choose the option to look for the registry file in CQPweb's usual directory (which is where I've placed the modified registry file), it says: "A data-directory path could not be found in the registry file for the CWB corpus you specified. Either the data-directory is unspecified, or it is specified with a relative path (an absolute path is needed)".

I'm using an absolute path in the registry file (/var/cqpweb/index/test_flxml_corpus), and all the files appear there in my file manager.

2. If I choose the option to specify the location of the registry and enter the exact same directory that CQPweb uses as its default, but manually, I get this error: "A corpus by that name already exists in the CQPweb registry!".

I also get error 2 if I put in "/dev/null/" or garbage text ("asdfasdfas").

Any idea what's going on?

Thanks,
Scott

_______________________________________________
CWB mailing list
CWB at liste.sslmit.unibo.it<mailto:CWB at liste.sslmit.unibo.it>
http://liste.sslmit.unibo.it/mailman/listinfo/cwb






--
Dr. Scott Sadowsky
Profesor Asistente de Lingüística
Pontificia Universidad Católica de Chile

ssadowsky gmail com
scsadowsky uc cl
http://sadowsky.cl/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20160724/132711e9/attachment-0001.html>


More information about the CWB mailing list