[CWB] WebInABox: Can't import existing corpora from host

Scott Sadowsky ssadowsky at gmail.com
Mon Jul 25 18:14:44 CEST 2016


On Mon, Jul 25, 2016 at 5:48 AM, Hardie, Andrew <a.hardie at lancaster.ac.uk>
wrote:

Try running
>
>
>
>           select * from xml_metadata;
>
>
>
> in the MySQL command line client, and see what you get.
>

This is what I get:

$ mysql -u root -p cqpweb
Enter password:
Reading table information for completion of table and column names
[...]
mysql> select * from xml_metadata;
+----+------------+---------+------------+-------------+----------+
| id | corpus     | handle  | att_family | description | datatype |
+----+------------+---------+------------+-------------+----------+
|  1 | bncsampler | s       | s          | s           |        0 |
|  2 | bncsampler | text    | text       | text        |        0 |
|  3 | bncsampler | text_id | text       | text_id     |        3 |
|  4 | lcmc       | s       | s          | s           |        0 |
|  5 | lcmc       | text    | text       | text        |        0 |
|  6 | lcmc       | text_id | text       | text_id     |        3 |
+----+------------+---------+------------+-------------+----------+
6 rows in set (0.00 sec)

mysql>


I have noted something anomalous on another front which may be relevant.
When I go to the "Manage Metadata" page of the corpus I'm trying to get set
up, and hit the "Create minimalist metadata table" button, I get an error
which has nothing to do with my current corpus:

The data source you specified for the text metadata contains
badly-formatted text ID codes, as follows: <strong> '<no annotation>';
'CCN-F2-01_Ca_St.ortografica.txt'; 'CCN-F2-02_D_StB.ortografica.txt';
'CCN-F2-03_Ca_St.ortografica.txt';
'CCN-F2-04_Cb_St.ortografica.txt';[...]</strong> (text ids can only contain
unaccented letters, numbers, and underscore).

None of these values are present in my current corpus, though they *were*
in an earlier version, However, I removed them from the tagged texts after
you explained that these values had to be handles. Here's what my metadata
currently looks like:

<text id="CCN_F2_27_B" corpus="coscach" tagger="freeling_xml"
language="spanish" channel="oral" instrument="interview"
lingualism="monolingual" location="concepcion" sex="f" generation="G2"
sel="B">

So values like 'CCN-F2-01_Ca_St.ortografica.txt' are not in my corpus any
more (and I recompiled it from these files, of course), but they seem to be
cached somewhere by CQPweb, and they are not getting updated by newer
corpora I try to import. (Note that I've used different names, e.g.
test_corpus, test_corpus_two, in order to try to get around this, but it
hasn't worked).

Cheers,
Scott



>
>
> best
>
>
>
> Andrew.
>
>
>
>
>
>
>
> *From:* cwb-bounces at liste.sslmit.unibo.it [mailto:
> cwb-bounces at liste.sslmit.unibo.it] *On Behalf Of *Scott Sadowsky
> *Sent:* 24 July 2016 17:17
> *To:* Open source development of the Corpus WorkBench
> *Cc:* Open source development of the Corpus WorkBench
> *Subject:* Re: [CWB] WebInABox: Can't import existing corpora from host
>
>
>
> On Sun, Jul 24, 2016 at 11:29 AM, Hardie, Andrew <a.hardie at lancaster.ac.uk>
> wrote:
>
>
>
> First point – your text ID codes won’t work, they need to be *handles*,
> i.e. just ASCII letters, numbers, and underscore – no hyphens/full stops.
>
>
>
> Now corrected!
>
>
>
> Second point – the various s-attributes text_corpus , text_tagger etc.
> need (a) to exist in the registry – did your correction fix this? (b)
> CQPweb needs to have logged their existence – if it’s saying “No XML
> annotations found” that suggests it hasn’t, which could be a consequence of
> (a), or could be a bug.
>
>
>
> Unless I'm mistaken about what attributes are what, they are indeed in the
> registry. I've pasted it at the end of this e-mail, along with a single
> tagged source text sentence.
>
>
>
> There was in fact a bug with s-attributes in the registry failing to be
> detected which I fixed a few months back: I cannot recall if that was
> before or after the version of the code in the VM image. If you want to
> rule this out, connect the VM’s networking, upgrade CQPweb to the latest
> version from SVN (don’t forget to do the database upgrade!), and try again:
> if that fixes it, it was the old bug.
>
>
>
> I've been using revision 879 (3.2.20) the whole time, so it shouldn't be
> the old bug.
>
>
>
>
>
> Once CQPweb is aware of your XML attributes you should be able to use them
> to derive text metadata.
>
>
>
> Thanks for your patience!
>
>
>
> Cheers,
>
> Scott
>
>
>
>
>
> <text id="CCN_F2_25_Ca" corpus="test_two" tagger="freeling_xml"
> language="spanish" channel="oral" instrument="interview"
> lingualism="monolingual" location="concepcion" sex="f" generation="G2"
> sel="Ca">
>
> <s>
>
> ¿       ¿       Fia     Fia     punctuation     questionmark
>
> todavía todavía RG      RG      adverb  general
>
> está    estar   VAIP3S0 VAI     verb    auxiliary
>
> grabando        grabar  VMG0000 VMG     verb    main
>
> ?       ?       Fit     Fit     punctuation     questionmark
>
> </s>
>
> </text>
>
>
>
>
>
>
>
> ##
>
> ## registry entry for corpus TEST_TWO
>
> ##
>
>
>
> # long descriptive name for the corpus
>
> NAME ""
>
> # corpus ID (must be lowercase in registry!)
>
> ID   test_two
>
> # path to binary data files
>
> HOME /var/cqpweb/index/test_two
>
> # optional info file (displayed by "info;" command in CQP)
>
> INFO /var/cqpweb/index/test_two/.info
>
>
>
> # corpus properties provide additional information about the corpus:
>
> ##:: charset  = "utf8" # character encoding of corpus data
>
> ##:: language = "es"     # insert ISO code for language (de, en, fr, ...)
>
>
>
>
>
> ##
>
> ## p-attributes (token annotations)
>
> ##
>
>
>
> ATTRIBUTE word
>
> ATTRIBUTE lemma
>
> ATTRIBUTE tag
>
> ATTRIBUTE ctag
>
> ATTRIBUTE pos
>
> ATTRIBUTE type
>
>
>
>
>
> ##
>
> ## s-attributes (structural markup)
>
> ##
>
>
>
> # <s> ... </s>
>
> # (no recursive embedding allowed)
>
> STRUCTURE s
>
>
>
> # <text id=".." corpus=".." tagger=".." file=".." language=".."
> channel=".." instrument=".." lingualism=".." location=".." sex=".."
> generation=".." sel=".."> ... </text>
>
> # (no recursive embedding allowed)
>
> STRUCTURE text
>
> STRUCTURE text_id              # [annotations]
>
> STRUCTURE text_corpus          # [annotations]
>
> STRUCTURE text_tagger          # [annotations]
>
> STRUCTURE text_file            # [annotations]
>
> STRUCTURE text_language        # [annotations]
>
> STRUCTURE text_channel         # [annotations]
>
> STRUCTURE text_instrument      # [annotations]
>
> STRUCTURE text_lingualism      # [annotations]
>
> STRUCTURE text_location        # [annotations]
>
> STRUCTURE text_sex             # [annotations]
>
> STRUCTURE text_generation      # [annotations]
>
> STRUCTURE text_sel             # [annotations]
>
>
>
>
>
> # Yours sincerely, the Encode tool.
>
>
>
>
>
>
>
> *From:* cwb-bounces at liste.sslmit.unibo.it [mailto:
> cwb-bounces at liste.sslmit.unibo.it] *On Behalf Of *Scott Sadowsky
> *Sent:* 24 July 2016 15:52
> *To:* CWBdev Mailing List
>
>
> *Subject:* [CWB] WebInABox: Can't import existing corpora from host
>
>
>
> On Sun, Jul 24, 2016 at 10:19 AM, Hardie, Andrew <a.hardie at lancaster.ac.uk>
> wrote:
>
>
>
> CQPweb requires all corpora to have at least one <text> element, and every
> text element has to have an id i.e. everything within the corpus has to be
> contained within a sequence of one or more
>
>
>
> <text id=”somethinghere”>
>
>>
> </text>
>
>
>
> Thanks, Andrew. It turns out the problem was that I had been using the
> name "id" instead of "text" for the element. Now that I've changed that, I
> was able to successfully create the corpus in CQPweb.
>
>
>
> My source files have quite a bit of metadata, which I've encoded as
> follows:
>
>
>
> <text id="CCN-F2-02_D_StB.ortografica.txt" corpus="test" tagger="freeling-xml"
> language="spanish" location="concepcion" sex="f">
>
> ...
>
> </text>
>
>
> I'm now at the CQPweb "Design and insert a text-metadata table for the
> corpus" page, but it tells me that "No XML annotations found for this
> corpus". Is there something wrong with how I did the encoding above? I can
> use all of these XML elements in cqp searches directly, but here they
> aren't recognized.
>
>
>
> (I've checked chapter 6 of the manual, to no avail).
>
>
>
> Best wishes,
>
> Scott
>
>
>
>
> _______________________________________________
> CWB mailing list
> CWB at liste.sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>
>


-- 
Dr. Scott Sadowsky
Profesor Asistente de Lingüística
Pontificia Universidad Católica de Chile

ssadowsky gmail com
scsadowsky uc cl
http://sadowsky.cl/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20160725/33f221fe/attachment-0001.html>


More information about the CWB mailing list