[CWB] WebInABox: Can't import existing corpora from host

Scott Sadowsky ssadowsky at gmail.com
Tue Jul 26 18:12:05 CEST 2016


On Tue, Jul 26, 2016 at 7:25 AM, Hardie, Andrew <a.hardie at lancaster.ac.uk>
wrote:

Hi Andrew,

> I have had a dig, and found the bug (it was a regex glitch parsing the
> inserted registry file). Update the code to rev 880 and you should find
> that the system will obediently detect your s-attributes. (You will still,
> naturally, need to go through the first step that IO mentioned,  of making
> sure all data from earlier passes is properly scrubbed.)
>
Eureka - with this new rev CQPweb now imports my XML metadata! Thanks so
much for hunting this down and fixing it!

I've now done the following:

1. I went through the "Manage Corpus XML" page and set descriptions and
data types, defining the attributes I want to be able to search on in
queries, subqueries, sub-corpora, etc. to "classification" (e.g. speaker
sex and location).

2. I went through the "Manage Annotation" page and linked the "Annotation
setup for CEQL queries" fields to the various annotation data in my corpus.

3. On the "Manage frequency lists" page I (re)generated everything (I've
attached the metadata table from mysql below).

I can now perform queries, and my metadata is recognized. But how do I
restrict searches using the s-attributes (say, speaker sex)? When I do a
query and then select "Distribution", for example, I'm told that "This
corpus has no text-classification metadata, so the distribution cannot be
shown".

Thanks!
Scott


mysql> select * from xml_metadata;
+----+--------------+-----------------+------------+-----------------------------------+----------+
| id | corpus       | handle          | att_family | description
            | datatype |
+----+--------------+-----------------+------------+-----------------------------------+----------+
|  1 | bncsampler   | s               | s          | s
            |        0 |
|  2 | bncsampler   | text            | text       | text
           |        0 |
|  3 | bncsampler   | text_id         | text       | text_id
            |        3 |
|  4 | lcmc         | s               | s          | s
            |        0 |
|  5 | lcmc         | text            | text       | text
           |        0 |
|  6 | lcmc         | text_id         | text       | text_id
            |        3 |
|  7 | test_coscach | s               | s          | Sentence
           |        0 |
|  8 | test_coscach | text            | text       | Text
           |        0 |
|  9 | test_coscach | text_id         | text       | Unique Text ID
           |        3 |
| 10 | test_coscach | text_corpus     | text       | Corpus name
            |        2 |
| 11 | test_coscach | text_tagger     | text       | Corpus tagger
            |        2 |
| 12 | test_coscach | text_language   | text       | Text language
            |        1 |
| 13 | test_coscach | text_channel    | text       | Spoken or written?
           |        2 |
| 14 | test_coscach | text_instrument | text       | Elicitation instrument
           |        1 |
| 15 | test_coscach | text_lingualism | text       | Speaker monolingual or
bilingual? |        1 |
| 16 | test_coscach | text_location   | text       | Speaker location
           |        1 |
| 17 | test_coscach | text_sex        | text       | Speaker sex
            |        1 |
| 18 | test_coscach | text_generation | text       | Speaker generation
           |        1 |
| 19 | test_coscach | text_sel        | text       | Speaker SEL
            |        1 |
+----+--------------+-----------------+------------+-----------------------------------+----------+
19 rows in set (0.00 sec)

mysql>


> *From:* Hardie, Andrew
> *Sent:* 25 July 2016 23:48
> *To:* Open source development of the Corpus WorkBench
> *Subject:* RE: [CWB] WebInABox: Can't import existing corpora from host
>
>
>
> OK, 2 things:
>
>
>
> First – the result of the MySQL query shows that none of the XML of your
> corpus has been detected.
>
>
>
> Second – the other error you report is clearly referring to your earlier
> index data. The check on text ID validity is done at point of extraction
> *from* the index *to *CQPweb’s internal data structures. So, it is
> reading the index and getting bad values. This implies that your earelier
> index files still exist and are being read by CQPweb.
>
>
>
> So, the overall picture would seem to be that you have data hanging around
> from previous incarnations of the corpus, and your reinstallation did not
> work properly. Your best bet might be to make doubly sure everything is
> wiped from that corpus, then start over again. This will probably not fix
> all the problems but it *should* make the issues that remain clearer.
>
>
>
> best
>
>
>
> Andrew.
>
>
>
> *From:* cwb-bounces at liste.sslmit.unibo.it [mailto:
> cwb-bounces at liste.sslmit.unibo.it] *On Behalf Of *Scott Sadowsky
> *Sent:* 25 July 2016 17:15
> *To:* Open source development of the Corpus WorkBench
> *Cc:* Open source development of the Corpus WorkBench
> *Subject:* Re: [CWB] WebInABox: Can't import existing corpora from host
>
>
>
> On Mon, Jul 25, 2016 at 5:48 AM, Hardie, Andrew <a.hardie at lancaster.ac.uk>
> wrote:
>
>
>
> Try running
>
>
>
>           select * from xml_metadata;
>
>
>
> in the MySQL command line client, and see what you get.
>
>
>
> This is what I get:
>
>
>
> $ mysql -u root -p cqpweb
>
> Enter password:
>
> Reading table information for completion of table and column names
>
> [...]
>
> mysql> select * from xml_metadata;
>
> +----+------------+---------+------------+-------------+----------+
>
> | id | corpus     | handle  | att_family | description | datatype |
>
> +----+------------+---------+------------+-------------+----------+
>
> |  1 | bncsampler | s       | s          | s           |        0 |
>
> |  2 | bncsampler | text    | text       | text        |        0 |
>
> |  3 | bncsampler | text_id | text       | text_id     |        3 |
>
> |  4 | lcmc       | s       | s          | s           |        0 |
>
> |  5 | lcmc       | text    | text       | text        |        0 |
>
> |  6 | lcmc       | text_id | text       | text_id     |        3 |
>
> +----+------------+---------+------------+-------------+----------+
>
> 6 rows in set (0.00 sec)
>
>
>
> mysql>
>
>
>
>
>
> I have noted something anomalous on another front which may be relevant.
> When I go to the "Manage Metadata" page of the corpus I'm trying to get set
> up, and hit the "Create minimalist metadata table" button, I get an error
> which has nothing to do with my current corpus:
>
>
>
> The data source you specified for the text metadata contains
> badly-formatted text ID codes, as follows: <strong> '<no annotation>';
> 'CCN-F2-01_Ca_St.ortografica.txt'; 'CCN-F2-02_D_StB.ortografica.txt';
> 'CCN-F2-03_Ca_St.ortografica.txt';
> 'CCN-F2-04_Cb_St.ortografica.txt';[...]</strong> (text ids can only contain
> unaccented letters, numbers, and underscore).
>
>
>
> None of these values are present in my current corpus, though they *were*
> in an earlier version, However, I removed them from the tagged texts after
> you explained that these values had to be handles. Here's what my metadata
> currently looks like:
>
>
>
> <text id="CCN_F2_27_B" corpus="coscach" tagger="freeling_xml"
> language="spanish" channel="oral" instrument="interview"
> lingualism="monolingual" location="concepcion" sex="f" generation="G2"
> sel="B">
>
>
>
> So values like 'CCN-F2-01_Ca_St.ortografica.txt' are not in my corpus any
> more (and I recompiled it from these files, of course), but they seem to be
> cached somewhere by CQPweb, and they are not getting updated by newer
> corpora I try to import. (Note that I've used different names, e.g.
> test_corpus, test_corpus_two, in order to try to get around this, but it
> hasn't worked).
>
>
>
> Cheers,
> Scott
>
>
>
>
>
>
>
> best
>
>
>
> Andrew.
>
>
>
>
>
>
>
> *From:* cwb-bounces at liste.sslmit.unibo.it [mailto:
> cwb-bounces at liste.sslmit.unibo.it] *On Behalf Of *Scott Sadowsky
> *Sent:* 24 July 2016 17:17
> *To:* Open source development of the Corpus WorkBench
> *Cc:* Open source development of the Corpus WorkBench
> *Subject:* Re: [CWB] WebInABox: Can't import existing corpora from host
>
>
>
> On Sun, Jul 24, 2016 at 11:29 AM, Hardie, Andrew <a.hardie at lancaster.ac.uk>
> wrote:
>
>
>
> First point – your text ID codes won’t work, they need to be *handles*,
> i.e. just ASCII letters, numbers, and underscore – no hyphens/full stops.
>
>
>
> Now corrected!
>
>
>
> Second point – the various s-attributes text_corpus , text_tagger etc.
> need (a) to exist in the registry – did your correction fix this? (b)
> CQPweb needs to have logged their existence – if it’s saying “No XML
> annotations found” that suggests it hasn’t, which could be a consequence of
> (a), or could be a bug.
>
>
>
> Unless I'm mistaken about what attributes are what, they are indeed in the
> registry. I've pasted it at the end of this e-mail, along with a single
> tagged source text sentence.
>
>
>
> There was in fact a bug with s-attributes in the registry failing to be
> detected which I fixed a few months back: I cannot recall if that was
> before or after the version of the code in the VM image. If you want to
> rule this out, connect the VM’s networking, upgrade CQPweb to the latest
> version from SVN (don’t forget to do the database upgrade!), and try again:
> if that fixes it, it was the old bug.
>
>
>
> I've been using revision 879 (3.2.20) the whole time, so it shouldn't be
> the old bug.
>
>
>
>
>
> Once CQPweb is aware of your XML attributes you should be able to use them
> to derive text metadata.
>
>
>
> Thanks for your patience!
>
>
>
> Cheers,
>
> Scott
>
>
>
>
>
> <text id="CCN_F2_25_Ca" corpus="test_two" tagger="freeling_xml"
> language="spanish" channel="oral" instrument="interview"
> lingualism="monolingual" location="concepcion" sex="f" generation="G2"
> sel="Ca">
>
> <s>
>
> ¿       ¿       Fia     Fia     punctuation     questionmark
>
> todavía todavía RG      RG      adverb  general
>
> está    estar   VAIP3S0 VAI     verb    auxiliary
>
> grabando        grabar  VMG0000 VMG     verb    main
>
> ?       ?       Fit     Fit     punctuation     questionmark
>
> </s>
>
> </text>
>
>
>
>
>
>
>
> ##
>
> ## registry entry for corpus TEST_TWO
>
> ##
>
>
>
> # long descriptive name for the corpus
>
> NAME ""
>
> # corpus ID (must be lowercase in registry!)
>
> ID   test_two
>
> # path to binary data files
>
> HOME /var/cqpweb/index/test_two
>
> # optional info file (displayed by "info;" command in CQP)
>
> INFO /var/cqpweb/index/test_two/.info
>
>
>
> # corpus properties provide additional information about the corpus:
>
> ##:: charset  = "utf8" # character encoding of corpus data
>
> ##:: language = "es"     # insert ISO code for language (de, en, fr, ...)
>
>
>
>
>
> ##
>
> ## p-attributes (token annotations)
>
> ##
>
>
>
> ATTRIBUTE word
>
> ATTRIBUTE lemma
>
> ATTRIBUTE tag
>
> ATTRIBUTE ctag
>
> ATTRIBUTE pos
>
> ATTRIBUTE type
>
>
>
>
>
> ##
>
> ## s-attributes (structural markup)
>
> ##
>
>
>
> # <s> ... </s>
>
> # (no recursive embedding allowed)
>
> STRUCTURE s
>
>
>
> # <text id=".." corpus=".." tagger=".." file=".." language=".."
> channel=".." instrument=".." lingualism=".." location=".." sex=".."
> generation=".." sel=".."> ... </text>
>
> # (no recursive embedding allowed)
>
> STRUCTURE text
>
> STRUCTURE text_id              # [annotations]
>
> STRUCTURE text_corpus          # [annotations]
>
> STRUCTURE text_tagger          # [annotations]
>
> STRUCTURE text_file            # [annotations]
>
> STRUCTURE text_language        # [annotations]
>
> STRUCTURE text_channel         # [annotations]
>
> STRUCTURE text_instrument      # [annotations]
>
> STRUCTURE text_lingualism      # [annotations]
>
> STRUCTURE text_location        # [annotations]
>
> STRUCTURE text_sex             # [annotations]
>
> STRUCTURE text_generation      # [annotations]
>
> STRUCTURE text_sel             # [annotations]
>
>
>
>
>
> # Yours sincerely, the Encode tool.
>
>
>
>
>
>
>
> *From:* cwb-bounces at liste.sslmit.unibo.it [mailto:
> cwb-bounces at liste.sslmit.unibo.it] *On Behalf Of *Scott Sadowsky
> *Sent:* 24 July 2016 15:52
> *To:* CWBdev Mailing List
>
>
> *Subject:* [CWB] WebInABox: Can't import existing corpora from host
>
>
>
> On Sun, Jul 24, 2016 at 10:19 AM, Hardie, Andrew <a.hardie at lancaster.ac.uk>
> wrote:
>
>
>
> CQPweb requires all corpora to have at least one <text> element, and every
> text element has to have an id i.e. everything within the corpus has to be
> contained within a sequence of one or more
>
>
>
> <text id=”somethinghere”>
>
>>
> </text>
>
>
>
> Thanks, Andrew. It turns out the problem was that I had been using the
> name "id" instead of "text" for the element. Now that I've changed that, I
> was able to successfully create the corpus in CQPweb.
>
>
>
> My source files have quite a bit of metadata, which I've encoded as
> follows:
>
>
>
> <text id="CCN-F2-02_D_StB.ortografica.txt" corpus="test" tagger="freeling-xml"
> language="spanish" location="concepcion" sex="f">
>
> ...
>
> </text>
>
>
> I'm now at the CQPweb "Design and insert a text-metadata table for the
> corpus" page, but it tells me that "No XML annotations found for this
> corpus". Is there something wrong with how I did the encoding above? I can
> use all of these XML elements in cqp searches directly, but here they
> aren't recognized.
>
>
>
> (I've checked chapter 6 of the manual, to no avail).
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20160726/978d1ccc/attachment-0001.html>


More information about the CWB mailing list