[CWB] WebInABox: Can't import existing corpora from host

Scott Sadowsky ssadowsky at gmail.com
Sun Jul 24 16:51:53 CEST 2016


On Sun, Jul 24, 2016 at 10:19 AM, Hardie, Andrew <a.hardie at lancaster.ac.uk>
wrote:

CQPweb requires all corpora to have at least one <text> element, and every
> text element has to have an id i.e. everything within the corpus has to be
> contained within a sequence of one or more
>
>
>
> <text id=”somethinghere”>
>
>>
> </text>
>

Thanks, Andrew. It turns out the problem was that I had been using the name
"id" instead of "text" for the element. Now that I've changed that, I was
able to successfully create the corpus in CQPweb.

My source files have quite a bit of metadata, which I've encoded as follows:

<text id="CCN-F2-02_D_StB.ortografica.txt" corpus="test" tagger="freeling-xml"
language="spanish" location="concepcion" sex="f">
...
</text>

I'm now at the CQPweb "Design and insert a text-metadata table for the
corpus" page, but it tells me that "No XML annotations found for this
corpus". Is there something wrong with how I did the encoding above? I can
use all of these XML elements in cqp searches directly, but here they
aren't recognized.

(I've checked chapter 6 of the manual, to no avail).

Best wishes,
Scott



*From:* cwb-bounces at liste.sslmit.unibo.it [mailto:
> cwb-bounces at liste.sslmit.unibo.it] *On Behalf Of *Scott Sadowsky
>
*Sent:* 24 July 2016 14:10
> *To:* Open source development of the Corpus WorkBench; CWBdev Mailing List
> *Subject:* Re: [CWB] WebInABox: Can't import existing corpora from host
>
>
>
> On Sat, Jul 23, 2016 at 3:19 PM, Hardie, Andrew <a.hardie at lancaster.ac.uk>
> wrote:
>
>
>
> Hi Andrew,
>
>
>
> Might it be a permissions issue? Depending on how you mounted it, the Vbox
> shared folder containing the index data may not be accessible to the http
> daemon. Check with ls –l.
>
>
>
> Please check this, if it’s not this, then please post the HOME line of the
> registry in your reply, and I’ll use that to check the code
>
>
>
> Thanks, Andrew. It was indeed a permissions issue. In order to
> troubleshoot this (as symlinks can be tricky), I copied the index files and
> registry into the CQPWiaB VM and placed them into the same directories as
> the BNC sampler and Mandarin corpora. The problems persisted, so I changed
> permissions and ownership as follows (replace test_flxml_corpus with the
> name of your corpus):
>
>
>
> cd /var/cqpweb/index
>
> sudo chown www-data:www-data test_flxml_corpus
>
> sudo chmod 755 test_flxml_corpus
>
>
>
> cd test_flxml_corpus
>
> sudo chown www-data:www-data *
>
> sudo chmod 644 *
>
>
>
> cd ../../registry/
>
> sudo chown www-data:www-data test_flxml_corpus
>
> sudo chmod 664 test_flxml_corpus
>
>
>
>
>
> So now I can attempt to import the corpus, but I run into a new error:
> "Pre-indexed corpora require s-attributes text and text_id!!".  I've
> searched the manual included in CQPWiaB but there's no mention of
> "text_id". What am I doing wrong?
>
>
>
> Below is the content of my registry file, in case that helps.
>
>
>
> Thanks!
>
> Scott
>
>
>
> ##
>
> ## registry entry for corpus TEST_FLXML_CORPUS
>
> ##
>
>
>
> # long descriptive name for the corpus
>
> NAME "Test corpus using FreeLing XML tagger"
>
> # corpus ID (must be lowercase in registry!)
>
> ID test_flxml_corpus
>
> # path to binary data files
>
> HOME /var/cqpweb/index/test_flxml_corpus
>
> # optional info file (displayed by "info;" command in CQP)
>
> INFO /var/cqpweb/index/test_flxml_corpus/.info
>
>
>
> # corpus properties provide additional information about the corpus:
>
> ##:: charset  = "utf8" # character encoding of corpus data
>
> ##:: language = "es-CL" # insert ISO code for language (de, en, fr, ...)
>
>
>
>
>
> ##
>
> ## p-attributes (token annotations)
>
> ##
>
>
>
> ATTRIBUTE word
>
> ATTRIBUTE lemma
>
> ATTRIBUTE tag
>
> ATTRIBUTE ctag
>
> ATTRIBUTE pos
>
> ATTRIBUTE type
>
>
>
>
>
> ##
>
> ## s-attributes (structural markup)
>
> ##
>
>
>
> # <s> ... </s>
>
> # (no recursive embedding allowed)
>
> STRUCTURE s
>
>
>
> # <id corpus=".." tagger=".." file=".." label=".." channel=".."
> audience=".." purpose=".." genre=".." field=".." area=".." source=".."> ...
> </id>
>
> # (no recursive embedding allowed)
>
> STRUCTURE id
>
> STRUCTURE id_corpus            # [annotations]
>
> STRUCTURE id_tagger            # [annotations]
>
> STRUCTURE id_file              # [annotations]
>
> STRUCTURE id_label             # [annotations]
>
> STRUCTURE id_channel           # [annotations]
>
> STRUCTURE id_audience          # [annotations]
>
> STRUCTURE id_purpose           # [annotations]
>
> STRUCTURE id_genre             # [annotations]
>
> STRUCTURE id_field             # [annotations]
>
> STRUCTURE id_area              # [annotations]
>
> STRUCTURE id_source            # [annotations]
>
>
>
>
>
> # Yours sincerely, the Encode tool.
>
>
>
>
>
> *From:* cwb-bounces at liste.sslmit.unibo.it [
> mailto:cwb-bounces at liste.sslmit.unibo.it
> <cwb-bounces at liste.sslmit.unibo.it>] *On Behalf Of *Scott Sadowsky
> *Sent:* 23 July 2016 20:07
> *To:* Open source development of the Corpus WorkBench
> *Subject:* [CWB] WebInABox: Can't import existing corpora from host
>
>
>
> Hi all!
>
>
>
> I'm using the brilliant CQP Web in a Box to try and install an
> already-indexed corpus. This corpus is located on my host machine, and I'm
> using VirtualBox's virtual folders to access it from within CQPWiaB. I've
> made a local copy of the registry file, placed it inside VirtualBox, and
> edited it to reflect the difference in paths between the host machine and
> the virtual machine, and everything seems to be where it should be (or
> point to where it should point). But when I go to CQPWiaB's "Install a
> corpus you have already indexed in CWB", enter the corpus's name and try to
> install it, I get one of two errors:
>
>
>
> 1. If I choose the option to look for the registry file in CQPweb's usual
> directory (which is where I've placed the modified registry file), it says:
> "A data-directory path could not be found in the registry file for the CWB
> corpus you specified. Either the data-directory is unspecified, or it is
> specified with a relative path (an absolute path is needed)".
>
>
>
> I'm using an absolute path in the registry file
> (/var/cqpweb/index/test_flxml_corpus), and all the files appear there in my
> file manager.
>
>
>
> 2. If I choose the option to specify the location of the registry and
> enter the exact same directory that CQPweb uses as its default, but
> manually, I get this error: "A corpus by that name already exists in the
> CQPweb registry!".
>
>
>
> I also get error 2 if I put in "/dev/null/" or garbage text ("asdfasdfas").
>
>
>
> Any idea what's going on?
>
>
>
> Thanks,
> Scott
>
>
> _______________________________________________
> CWB mailing list
> CWB at liste.sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>






-- 
Dr. Scott Sadowsky
Profesor Asistente de Lingüística
Pontificia Universidad Católica de Chile

ssadowsky gmail com
scsadowsky uc cl
http://sadowsky.cl/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20160724/a0dbc37a/attachment-0001.html>


More information about the CWB mailing list