[CWB] WebInABox: Can't import existing corpora from host

Hardie, Andrew a.hardie at lancaster.ac.uk
Mon Jul 25 11:48:06 CEST 2016


Try running

          select * from xml_metadata;

in the MySQL command line client, and see what you get.

best

Andrew.



From: cwb-bounces at liste.sslmit.unibo.it [mailto:cwb-bounces at liste.sslmit.unibo.it] On Behalf Of Scott Sadowsky
Sent: 24 July 2016 17:17
To: Open source development of the Corpus WorkBench
Cc: Open source development of the Corpus WorkBench
Subject: Re: [CWB] WebInABox: Can't import existing corpora from host

On Sun, Jul 24, 2016 at 11:29 AM, Hardie, Andrew <a.hardie at lancaster.ac.uk<mailto:a.hardie at lancaster.ac.uk>> wrote:

First point – your text ID codes won’t work, they need to be handles, i.e. just ASCII letters, numbers, and underscore – no hyphens/full stops.

Now corrected!

Second point – the various s-attributes text_corpus , text_tagger etc. need (a) to exist in the registry – did your correction fix this? (b) CQPweb needs to have logged their existence – if it’s saying “No XML annotations found” that suggests it hasn’t, which could be a consequence of (a), or could be a bug.

Unless I'm mistaken about what attributes are what, they are indeed in the registry. I've pasted it at the end of this e-mail, along with a single tagged source text sentence.

There was in fact a bug with s-attributes in the registry failing to be detected which I fixed a few months back: I cannot recall if that was before or after the version of the code in the VM image. If you want to rule this out, connect the VM’s networking, upgrade CQPweb to the latest version from SVN (don’t forget to do the database upgrade!), and try again: if that fixes it, it was the old bug.

I've been using revision 879 (3.2.20) the whole time, so it shouldn't be the old bug.


Once CQPweb is aware of your XML attributes you should be able to use them to derive text metadata.

Thanks for your patience!

Cheers,
Scott


<text id="CCN_F2_25_Ca" corpus="test_two" tagger="freeling_xml" language="spanish" channel="oral" instrument="interview" lingualism="monolingual" location="concepcion" sex="f" generation="G2" sel="Ca">
<s>
¿       ¿       Fia     Fia     punctuation     questionmark
todavía todavía RG      RG      adverb  general
está    estar   VAIP3S0 VAI     verb    auxiliary
grabando        grabar  VMG0000 VMG     verb    main
?       ?       Fit     Fit     punctuation     questionmark
</s>
</text>



##
## registry entry for corpus TEST_TWO
##

# long descriptive name for the corpus
NAME ""
# corpus ID (must be lowercase in registry!)
ID   test_two
# path to binary data files
HOME /var/cqpweb/index/test_two
# optional info file (displayed by "info;" command in CQP)
INFO /var/cqpweb/index/test_two/.info

# corpus properties provide additional information about the corpus:
##:: charset  = "utf8" # character encoding of corpus data
##:: language = "es"     # insert ISO code for language (de, en, fr, ...)


##
## p-attributes (token annotations)
##

ATTRIBUTE word
ATTRIBUTE lemma
ATTRIBUTE tag
ATTRIBUTE ctag
ATTRIBUTE pos
ATTRIBUTE type


##
## s-attributes (structural markup)
##

# <s> ... </s>
# (no recursive embedding allowed)
STRUCTURE s

# <text id=".." corpus=".." tagger=".." file=".." language=".." channel=".." instrument=".." lingualism=".." location=".." sex=".." generation=".." sel=".."> ... </text>
# (no recursive embedding allowed)
STRUCTURE text
STRUCTURE text_id              # [annotations]
STRUCTURE text_corpus          # [annotations]
STRUCTURE text_tagger          # [annotations]
STRUCTURE text_file            # [annotations]
STRUCTURE text_language        # [annotations]
STRUCTURE text_channel         # [annotations]
STRUCTURE text_instrument      # [annotations]
STRUCTURE text_lingualism      # [annotations]
STRUCTURE text_location        # [annotations]
STRUCTURE text_sex             # [annotations]
STRUCTURE text_generation      # [annotations]
STRUCTURE text_sel             # [annotations]


# Yours sincerely, the Encode tool.



From: cwb-bounces at liste.sslmit.unibo.it<mailto:cwb-bounces at liste.sslmit.unibo.it> [mailto:cwb-bounces at liste.sslmit.unibo.it<mailto:cwb-bounces at liste.sslmit.unibo.it>] On Behalf Of Scott Sadowsky
Sent: 24 July 2016 15:52
To: CWBdev Mailing List

Subject: [CWB] WebInABox: Can't import existing corpora from host

On Sun, Jul 24, 2016 at 10:19 AM, Hardie, Andrew <a.hardie at lancaster.ac.uk<mailto:a.hardie at lancaster.ac.uk>> wrote:

CQPweb requires all corpora to have at least one <text> element, and every text element has to have an id i.e. everything within the corpus has to be contained within a sequence of one or more

<text id=”somethinghere”>
…
</text>

Thanks, Andrew. It turns out the problem was that I had been using the name "id" instead of "text" for the element. Now that I've changed that, I was able to successfully create the corpus in CQPweb.

My source files have quite a bit of metadata, which I've encoded as follows:

<text id="CCN-F2-02_D_StB.ortografica.txt" corpus="test" tagger="freeling-xml" language="spanish" location="concepcion" sex="f">
...
</text>

I'm now at the CQPweb "Design and insert a text-metadata table for the corpus" page, but it tells me that "No XML annotations found for this corpus". Is there something wrong with how I did the encoding above? I can use all of these XML elements in cqp searches directly, but here they aren't recognized.

(I've checked chapter 6 of the manual, to no avail).

Best wishes,
Scott

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20160725/2abd7903/attachment-0001.html>


More information about the CWB mailing list