[CWB] WebInABox: Can't import existing corpora from host
Scott Sadowsky
ssadowsky at gmail.com
Sun Jul 24 18:16:44 CEST 2016
On Sun, Jul 24, 2016 at 11:29 AM, Hardie, Andrew <a.hardie at lancaster.ac.uk>
wrote:
First point – your text ID codes won’t work, they need to be *handles*,
> i.e. just ASCII letters, numbers, and underscore – no hyphens/full stops.
>
Now corrected!
> Second point – the various s-attributes text_corpus , text_tagger etc.
> need (a) to exist in the registry – did your correction fix this? (b)
> CQPweb needs to have logged their existence – if it’s saying “No XML
> annotations found” that suggests it hasn’t, which could be a consequence of
> (a), or could be a bug.
>
Unless I'm mistaken about what attributes are what, they are indeed in the
registry. I've pasted it at the end of this e-mail, along with a single
tagged source text sentence.
> There was in fact a bug with s-attributes in the registry failing to be
> detected which I fixed a few months back: I cannot recall if that was
> before or after the version of the code in the VM image. If you want to
> rule this out, connect the VM’s networking, upgrade CQPweb to the latest
> version from SVN (don’t forget to do the database upgrade!), and try again:
> if that fixes it, it was the old bug.
>
I've been using revision 879 (3.2.20) the whole time, so it shouldn't be
the old bug.
> Once CQPweb is aware of your XML attributes you should be able to use them
> to derive text metadata.
>
Thanks for your patience!
Cheers,
Scott
<text id="CCN_F2_25_Ca" corpus="test_two" tagger="freeling_xml"
language="spanish" channel="oral" instrument="interview"
lingualism="monolingual" location="concepcion" sex="f" generation="G2"
sel="Ca">
<s>
¿ ¿ Fia Fia punctuation questionmark
todavía todavía RG RG adverb general
está estar VAIP3S0 VAI verb auxiliary
grabando grabar VMG0000 VMG verb main
? ? Fit Fit punctuation questionmark
</s>
</text>
##
## registry entry for corpus TEST_TWO
##
# long descriptive name for the corpus
NAME ""
# corpus ID (must be lowercase in registry!)
ID test_two
# path to binary data files
HOME /var/cqpweb/index/test_two
# optional info file (displayed by "info;" command in CQP)
INFO /var/cqpweb/index/test_two/.info
# corpus properties provide additional information about the corpus:
##:: charset = "utf8" # character encoding of corpus data
##:: language = "es" # insert ISO code for language (de, en, fr, ...)
##
## p-attributes (token annotations)
##
ATTRIBUTE word
ATTRIBUTE lemma
ATTRIBUTE tag
ATTRIBUTE ctag
ATTRIBUTE pos
ATTRIBUTE type
##
## s-attributes (structural markup)
##
# <s> ... </s>
# (no recursive embedding allowed)
STRUCTURE s
# <text id=".." corpus=".." tagger=".." file=".." language=".."
channel=".." instrument=".." lingualism=".." location=".." sex=".."
generation=".." sel=".."> ... </text>
# (no recursive embedding allowed)
STRUCTURE text
STRUCTURE text_id # [annotations]
STRUCTURE text_corpus # [annotations]
STRUCTURE text_tagger # [annotations]
STRUCTURE text_file # [annotations]
STRUCTURE text_language # [annotations]
STRUCTURE text_channel # [annotations]
STRUCTURE text_instrument # [annotations]
STRUCTURE text_lingualism # [annotations]
STRUCTURE text_location # [annotations]
STRUCTURE text_sex # [annotations]
STRUCTURE text_generation # [annotations]
STRUCTURE text_sel # [annotations]
# Yours sincerely, the Encode tool.
> *From:* cwb-bounces at liste.sslmit.unibo.it [mailto:
> cwb-bounces at liste.sslmit.unibo.it] *On Behalf Of *Scott Sadowsky
> *Sent:* 24 July 2016 15:52
> *To:* CWBdev Mailing List
>
> *Subject:* [CWB] WebInABox: Can't import existing corpora from host
>
>
>
> On Sun, Jul 24, 2016 at 10:19 AM, Hardie, Andrew <a.hardie at lancaster.ac.uk>
> wrote:
>
>
>
> CQPweb requires all corpora to have at least one <text> element, and every
> text element has to have an id i.e. everything within the corpus has to be
> contained within a sequence of one or more
>
>
>
> <text id=”somethinghere”>
>
> …
>
> </text>
>
>
>
> Thanks, Andrew. It turns out the problem was that I had been using the
> name "id" instead of "text" for the element. Now that I've changed that, I
> was able to successfully create the corpus in CQPweb.
>
>
>
> My source files have quite a bit of metadata, which I've encoded as
> follows:
>
>
>
> <text id="CCN-F2-02_D_StB.ortografica.txt" corpus="test" tagger="freeling-xml"
> language="spanish" location="concepcion" sex="f">
>
> ...
>
> </text>
>
>
> I'm now at the CQPweb "Design and insert a text-metadata table for the
> corpus" page, but it tells me that "No XML annotations found for this
> corpus". Is there something wrong with how I did the encoding above? I can
> use all of these XML elements in cqp searches directly, but here they
> aren't recognized.
>
>
>
> (I've checked chapter 6 of the manual, to no avail).
>
>
>
> Best wishes,
>
> Scott
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20160724/9fa8c2b4/attachment.html>
More information about the CWB
mailing list