[CWB] WebInABox: Can't import existing corpora from host

Hardie, Andrew a.hardie at lancaster.ac.uk
Tue Jul 26 18:18:38 CEST 2016


>>> But how do I restrict searches using the s-attributes (say, speaker sex)? When I do a query and then select "Distribution", for example, I'm told that "This corpus has no text-classification metadata, so the distribution cannot be shown".


·         Go to Restricted query

·         You should see options to restrict your query to XML segments where the given attribute has a particular category handle for any s-att that you set to datatype “Classifcation”

·         OR, go to “Create / edit subcorpora” and define subcorpora using the same control, then use those SCs as restriction criteria.

Note that non-text-based corpus restrictions and subcorpora aren’t currently supported in the Distribution display. I know this is a pain, and it’s high on my feature list. (but quite a big job so can’t be done quickly!)

best

Andrew.

From: cwb-bounces at liste.sslmit.unibo.it [mailto:cwb-bounces at liste.sslmit.unibo.it] On Behalf Of Scott Sadowsky
Sent: 26 July 2016 17:12
To: Open source development of the Corpus WorkBench
Cc: Open source development of the Corpus WorkBench
Subject: Re: [CWB] WebInABox: Can't import existing corpora from host

On Tue, Jul 26, 2016 at 7:25 AM, Hardie, Andrew <a.hardie at lancaster.ac.uk<mailto:a.hardie at lancaster.ac.uk>> wrote:

Hi Andrew,

I have had a dig, and found the bug (it was a regex glitch parsing the inserted registry file). Update the code to rev 880 and you should find that the system will obediently detect your s-attributes. (You will still, naturally, need to go through the first step that IO mentioned,  of making sure all data from earlier passes is properly scrubbed.)
Eureka - with this new rev CQPweb now imports my XML metadata! Thanks so much for hunting this down and fixing it!

I've now done the following:

1. I went through the "Manage Corpus XML" page and set descriptions and data types, defining the attributes I want to be able to search on in queries, subqueries, sub-corpora, etc. to "classification" (e.g. speaker sex and location).

2. I went through the "Manage Annotation" page and linked the "Annotation setup for CEQL queries" fields to the various annotation data in my corpus.

3. On the "Manage frequency lists" page I (re)generated everything (I've attached the metadata table from mysql below).

I can now perform queries, and my metadata is recognized. But how do I restrict searches using the s-attributes (say, speaker sex)? When I do a query and then select "Distribution", for example, I'm told that "This corpus has no text-classification metadata, so the distribution cannot be shown".

Thanks!
Scott


mysql> select * from xml_metadata;
+----+--------------+-----------------+------------+-----------------------------------+----------+
| id | corpus       | handle          | att_family | description                       | datatype |
+----+--------------+-----------------+------------+-----------------------------------+----------+
|  1 | bncsampler   | s               | s          | s                                 |        0 |
|  2 | bncsampler   | text            | text       | text                              |        0 |
|  3 | bncsampler   | text_id         | text       | text_id                           |        3 |
|  4 | lcmc         | s               | s          | s                                 |        0 |
|  5 | lcmc         | text            | text       | text                              |        0 |
|  6 | lcmc         | text_id         | text       | text_id                           |        3 |
|  7 | test_coscach | s               | s          | Sentence                          |        0 |
|  8 | test_coscach | text            | text       | Text                              |        0 |
|  9 | test_coscach | text_id         | text       | Unique Text ID                    |        3 |
| 10 | test_coscach | text_corpus     | text       | Corpus name                       |        2 |
| 11 | test_coscach | text_tagger     | text       | Corpus tagger                     |        2 |
| 12 | test_coscach | text_language   | text       | Text language                     |        1 |
| 13 | test_coscach | text_channel    | text       | Spoken or written?                |        2 |
| 14 | test_coscach | text_instrument | text       | Elicitation instrument            |        1 |
| 15 | test_coscach | text_lingualism | text       | Speaker monolingual or bilingual? |        1 |
| 16 | test_coscach | text_location   | text       | Speaker location                  |        1 |
| 17 | test_coscach | text_sex        | text       | Speaker sex                       |        1 |
| 18 | test_coscach | text_generation | text       | Speaker generation                |        1 |
| 19 | test_coscach | text_sel        | text       | Speaker SEL                       |        1 |
+----+--------------+-----------------+------------+-----------------------------------+----------+
19 rows in set (0.00 sec)

mysql>




From: Hardie, Andrew
Sent: 25 July 2016 23:48
To: Open source development of the Corpus WorkBench
Subject: RE: [CWB] WebInABox: Can't import existing corpora from host



OK, 2 things:



First – the result of the MySQL query shows that none of the XML of your corpus has been detected.



Second – the other error you report is clearly referring to your earlier index data. The check on text ID validity is done at point of extraction from the index to CQPweb’s internal data structures. So, it is reading the index and getting bad values. This implies that your earelier index files still exist and are being read by CQPweb.



So, the overall picture would seem to be that you have data hanging around from previous incarnations of the corpus, and your reinstallation did not work properly. Your best bet might be to make doubly sure everything is wiped from that corpus, then start over again. This will probably not fix all the problems but it should make the issues that remain clearer.



best



Andrew.



From: cwb-bounces at liste.sslmit.unibo.it<mailto:cwb-bounces at liste.sslmit.unibo.it> [mailto:cwb-bounces at liste.sslmit.unibo.it<mailto:cwb-bounces at liste.sslmit.unibo.it>] On Behalf Of Scott Sadowsky
Sent: 25 July 2016 17:15
To: Open source development of the Corpus WorkBench
Cc: Open source development of the Corpus WorkBench
Subject: Re: [CWB] WebInABox: Can't import existing corpora from host



On Mon, Jul 25, 2016 at 5:48 AM, Hardie, Andrew <a.hardie at lancaster.ac.uk<mailto:a.hardie at lancaster.ac.uk>> wrote:



Try running



          select * from xml_metadata;



in the MySQL command line client, and see what you get.



This is what I get:



$ mysql -u root -p cqpweb

Enter password:

Reading table information for completion of table and column names

[...]

mysql> select * from xml_metadata;

+----+------------+---------+------------+-------------+----------+

| id | corpus     | handle  | att_family | description | datatype |

+----+------------+---------+------------+-------------+----------+

|  1 | bncsampler | s       | s          | s           |        0 |

|  2 | bncsampler | text    | text       | text        |        0 |

|  3 | bncsampler | text_id | text       | text_id     |        3 |

|  4 | lcmc       | s       | s          | s           |        0 |

|  5 | lcmc       | text    | text       | text        |        0 |

|  6 | lcmc       | text_id | text       | text_id     |        3 |

+----+------------+---------+------------+-------------+----------+

6 rows in set (0.00 sec)



mysql>





I have noted something anomalous on another front which may be relevant. When I go to the "Manage Metadata" page of the corpus I'm trying to get set up, and hit the "Create minimalist metadata table" button, I get an error which has nothing to do with my current corpus:



The data source you specified for the text metadata contains badly-formatted text ID codes, as follows: <strong> '<no annotation>'; 'CCN-F2-01_Ca_St.ortografica.txt'; 'CCN-F2-02_D_StB.ortografica.txt'; 'CCN-F2-03_Ca_St.ortografica.txt'; 'CCN-F2-04_Cb_St.ortografica.txt';[...]</strong> (text ids can only contain unaccented letters, numbers, and underscore).



None of these values are present in my current corpus, though they were in an earlier version, However, I removed them from the tagged texts after you explained that these values had to be handles. Here's what my metadata currently looks like:



<text id="CCN_F2_27_B" corpus="coscach" tagger="freeling_xml" language="spanish" channel="oral" instrument="interview" lingualism="monolingual" location="concepcion" sex="f" generation="G2" sel="B">



So values like 'CCN-F2-01_Ca_St.ortografica.txt' are not in my corpus any more (and I recompiled it from these files, of course), but they seem to be cached somewhere by CQPweb, and they are not getting updated by newer corpora I try to import. (Note that I've used different names, e.g. test_corpus, test_corpus_two, in order to try to get around this, but it hasn't worked).



Cheers,
Scott







best



Andrew.







From: cwb-bounces at liste.sslmit.unibo.it<mailto:cwb-bounces at liste.sslmit.unibo.it> [mailto:cwb-bounces at liste.sslmit.unibo.it<mailto:cwb-bounces at liste.sslmit.unibo.it>] On Behalf Of Scott Sadowsky
Sent: 24 July 2016 17:17
To: Open source development of the Corpus WorkBench
Cc: Open source development of the Corpus WorkBench
Subject: Re: [CWB] WebInABox: Can't import existing corpora from host



On Sun, Jul 24, 2016 at 11:29 AM, Hardie, Andrew <a.hardie at lancaster.ac.uk<mailto:a.hardie at lancaster.ac.uk>> wrote:



First point – your text ID codes won’t work, they need to be handles, i.e. just ASCII letters, numbers, and underscore – no hyphens/full stops.



Now corrected!



Second point – the various s-attributes text_corpus , text_tagger etc. need (a) to exist in the registry – did your correction fix this? (b) CQPweb needs to have logged their existence – if it’s saying “No XML annotations found” that suggests it hasn’t, which could be a consequence of (a), or could be a bug.



Unless I'm mistaken about what attributes are what, they are indeed in the registry. I've pasted it at the end of this e-mail, along with a single tagged source text sentence.



There was in fact a bug with s-attributes in the registry failing to be detected which I fixed a few months back: I cannot recall if that was before or after the version of the code in the VM image. If you want to rule this out, connect the VM’s networking, upgrade CQPweb to the latest version from SVN (don’t forget to do the database upgrade!), and try again: if that fixes it, it was the old bug.



I've been using revision 879 (3.2.20) the whole time, so it shouldn't be the old bug.





Once CQPweb is aware of your XML attributes you should be able to use them to derive text metadata.



Thanks for your patience!



Cheers,

Scott





<text id="CCN_F2_25_Ca" corpus="test_two" tagger="freeling_xml" language="spanish" channel="oral" instrument="interview" lingualism="monolingual" location="concepcion" sex="f" generation="G2" sel="Ca">

<s>

¿       ¿       Fia     Fia     punctuation     questionmark

todavía todavía RG      RG      adverb  general

está    estar   VAIP3S0 VAI     verb    auxiliary

grabando        grabar  VMG0000 VMG     verb    main

?       ?       Fit     Fit     punctuation     questionmark

</s>

</text>







##

## registry entry for corpus TEST_TWO

##



# long descriptive name for the corpus

NAME ""

# corpus ID (must be lowercase in registry!)

ID   test_two

# path to binary data files

HOME /var/cqpweb/index/test_two

# optional info file (displayed by "info;" command in CQP)

INFO /var/cqpweb/index/test_two/.info



# corpus properties provide additional information about the corpus:

##:: charset  = "utf8" # character encoding of corpus data

##:: language = "es"     # insert ISO code for language (de, en, fr, ...)





##

## p-attributes (token annotations)

##



ATTRIBUTE word

ATTRIBUTE lemma

ATTRIBUTE tag

ATTRIBUTE ctag

ATTRIBUTE pos

ATTRIBUTE type





##

## s-attributes (structural markup)

##



# <s> ... </s>

# (no recursive embedding allowed)

STRUCTURE s



# <text id=".." corpus=".." tagger=".." file=".." language=".." channel=".." instrument=".." lingualism=".." location=".." sex=".." generation=".." sel=".."> ... </text>

# (no recursive embedding allowed)

STRUCTURE text

STRUCTURE text_id              # [annotations]

STRUCTURE text_corpus          # [annotations]

STRUCTURE text_tagger          # [annotations]

STRUCTURE text_file            # [annotations]

STRUCTURE text_language        # [annotations]

STRUCTURE text_channel         # [annotations]

STRUCTURE text_instrument      # [annotations]

STRUCTURE text_lingualism      # [annotations]

STRUCTURE text_location        # [annotations]

STRUCTURE text_sex             # [annotations]

STRUCTURE text_generation      # [annotations]

STRUCTURE text_sel             # [annotations]





# Yours sincerely, the Encode tool.







From: cwb-bounces at liste.sslmit.unibo.it<mailto:cwb-bounces at liste.sslmit.unibo.it> [mailto:cwb-bounces at liste.sslmit.unibo.it<mailto:cwb-bounces at liste.sslmit.unibo.it>] On Behalf Of Scott Sadowsky
Sent: 24 July 2016 15:52
To: CWBdev Mailing List

Subject: [CWB] WebInABox: Can't import existing corpora from host



On Sun, Jul 24, 2016 at 10:19 AM, Hardie, Andrew <a.hardie at lancaster.ac.uk<mailto:a.hardie at lancaster.ac.uk>> wrote:



CQPweb requires all corpora to have at least one <text> element, and every text element has to have an id i.e. everything within the corpus has to be contained within a sequence of one or more



<text id=”somethinghere”>

…

</text>



Thanks, Andrew. It turns out the problem was that I had been using the name "id" instead of "text" for the element. Now that I've changed that, I was able to successfully create the corpus in CQPweb.



My source files have quite a bit of metadata, which I've encoded as follows:



<text id="CCN-F2-02_D_StB.ortografica.txt" corpus="test" tagger="freeling-xml" language="spanish" location="concepcion" sex="f">

...

</text>

I'm now at the CQPweb "Design and insert a text-metadata table for the corpus" page, but it tells me that "No XML annotations found for this corpus". Is there something wrong with how I did the encoding above? I can use all of these XML elements in cqp searches directly, but here they aren't recognized.



(I've checked chapter 6 of the manual, to no avail).


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20160726/507aeba3/attachment-0001.html>


More information about the CWB mailing list