[CWB] WebInABox: Can't import existing corpora from host

Scott Sadowsky ssadowsky at gmail.com
Sun Jul 24 18:16:44 CEST 2016


On Sun, Jul 24, 2016 at 11:29 AM, Hardie, Andrew <a.hardie at lancaster.ac.uk>
wrote:

First point – your text ID codes won’t work, they need to be *handles*,
> i.e. just ASCII letters, numbers, and underscore – no hyphens/full stops.
>

Now corrected!


> Second point – the various s-attributes text_corpus , text_tagger etc.
> need (a) to exist in the registry – did your correction fix this? (b)
> CQPweb needs to have logged their existence – if it’s saying “No XML
> annotations found” that suggests it hasn’t, which could be a consequence of
> (a), or could be a bug.
>

Unless I'm mistaken about what attributes are what, they are indeed in the
registry. I've pasted it at the end of this e-mail, along with a single
tagged source text sentence.


> There was in fact a bug with s-attributes in the registry failing to be
> detected which I fixed a few months back: I cannot recall if that was
> before or after the version of the code in the VM image. If you want to
> rule this out, connect the VM’s networking, upgrade CQPweb to the latest
> version from SVN (don’t forget to do the database upgrade!), and try again:
> if that fixes it, it was the old bug.
>

I've been using revision 879 (3.2.20) the whole time, so it shouldn't be
the old bug.



> Once CQPweb is aware of your XML attributes you should be able to use them
> to derive text metadata.
>

Thanks for your patience!

Cheers,
Scott


<text id="CCN_F2_25_Ca" corpus="test_two" tagger="freeling_xml"
language="spanish" channel="oral" instrument="interview"
lingualism="monolingual" location="concepcion" sex="f" generation="G2"
sel="Ca">
<s>
¿ ¿ Fia Fia punctuation questionmark
todavía todavía RG RG adverb general
está estar VAIP3S0 VAI verb auxiliary
grabando grabar VMG0000 VMG verb main
? ? Fit Fit punctuation questionmark
</s>
</text>



##
## registry entry for corpus TEST_TWO
##

# long descriptive name for the corpus
NAME ""
# corpus ID (must be lowercase in registry!)
ID   test_two
# path to binary data files
HOME /var/cqpweb/index/test_two
# optional info file (displayed by "info;" command in CQP)
INFO /var/cqpweb/index/test_two/.info

# corpus properties provide additional information about the corpus:
##:: charset  = "utf8" # character encoding of corpus data
##:: language = "es"     # insert ISO code for language (de, en, fr, ...)


##
## p-attributes (token annotations)
##

ATTRIBUTE word
ATTRIBUTE lemma
ATTRIBUTE tag
ATTRIBUTE ctag
ATTRIBUTE pos
ATTRIBUTE type


##
## s-attributes (structural markup)
##

# <s> ... </s>
# (no recursive embedding allowed)
STRUCTURE s

# <text id=".." corpus=".." tagger=".." file=".." language=".."
channel=".." instrument=".." lingualism=".." location=".." sex=".."
generation=".." sel=".."> ... </text>
# (no recursive embedding allowed)
STRUCTURE text
STRUCTURE text_id              # [annotations]
STRUCTURE text_corpus          # [annotations]
STRUCTURE text_tagger          # [annotations]
STRUCTURE text_file            # [annotations]
STRUCTURE text_language        # [annotations]
STRUCTURE text_channel         # [annotations]
STRUCTURE text_instrument      # [annotations]
STRUCTURE text_lingualism      # [annotations]
STRUCTURE text_location        # [annotations]
STRUCTURE text_sex             # [annotations]
STRUCTURE text_generation      # [annotations]
STRUCTURE text_sel             # [annotations]


# Yours sincerely, the Encode tool.




> *From:* cwb-bounces at liste.sslmit.unibo.it [mailto:
> cwb-bounces at liste.sslmit.unibo.it] *On Behalf Of *Scott Sadowsky
> *Sent:* 24 July 2016 15:52
> *To:* CWBdev Mailing List
>
> *Subject:* [CWB] WebInABox: Can't import existing corpora from host
>
>
>
> On Sun, Jul 24, 2016 at 10:19 AM, Hardie, Andrew <a.hardie at lancaster.ac.uk>
> wrote:
>
>
>
> CQPweb requires all corpora to have at least one <text> element, and every
> text element has to have an id i.e. everything within the corpus has to be
> contained within a sequence of one or more
>
>
>
> <text id=”somethinghere”>
>
>>
> </text>
>
>
>
> Thanks, Andrew. It turns out the problem was that I had been using the
> name "id" instead of "text" for the element. Now that I've changed that, I
> was able to successfully create the corpus in CQPweb.
>
>
>
> My source files have quite a bit of metadata, which I've encoded as
> follows:
>
>
>
> <text id="CCN-F2-02_D_StB.ortografica.txt" corpus="test" tagger="freeling-xml"
> language="spanish" location="concepcion" sex="f">
>
> ...
>
> </text>
>
>
> I'm now at the CQPweb "Design and insert a text-metadata table for the
> corpus" page, but it tells me that "No XML annotations found for this
> corpus". Is there something wrong with how I did the encoding above? I can
> use all of these XML elements in cqp searches directly, but here they
> aren't recognized.
>
>
>
> (I've checked chapter 6 of the manual, to no avail).
>
>
>
> Best wishes,
>
> Scott
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20160724/9fa8c2b4/attachment.html>


More information about the CWB mailing list