[CWB] Can't create metadata

Hardie, Andrew a.hardie at lancaster.ac.uk
Mon Nov 14 15:08:14 CET 2016


Daniel’s sample of a datafile exemplifies one of the two methods for more-than-minimal text metadata. This can either be loaded from a tab-delimited file, or deduced from XML. The latter method is the one Daniel exemplifies.

For minimal metadata you only require text with the ID attribute (whose values must be handles, i.e. just letters, numbers with no space / punctuation).

It is a rule of CQPweb corpora that the whole corpus needs to occur within <text> elements, each of which must have an id, and there can’t be any words that are not inside a <text> element. If you don’t care about text boundaries, you can just wrap the whole corpus in one <text id="CORPUS"> … </text>

This is explained in my paper:


  *   Hardie, Andrew (2012). CQPweb – combining power, flexibility and usability in a corpus analysis tool<http://www.ingentaconnect.com/content/jbp/ijcl/2012/00000017/00000003/art00004>. International Journal of Corpus Linguistics 17 (3): 380-409. [alternative link]<http://www.lancs.ac.uk/staff/hardiea/cqpweb-paper.pdf>

Sorry it’s not written up in the manual yet, only so many hours in a day alas…

best

Andrew.


From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Daniel Renau
Sent: 14 November 2016 13:46
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Can't create metadata


Hi jiayue,

My team works with verticalized texts like this:

<text id="ST1" title="namewithoutspaces" author="name">
<s>
word pos lemma
word pos lemma
word pos lemma
word pos lemma
</s>
</text>

<text id="ST2" title="anothertextname" author="otherperson">
<s>
word pos lemma
word pos lemma
word pos lemma
</s>
</text>

You can add more text tags as: author_sex, language, year, translator...

El 14 nov. 2016 2:37 p. m., "Jiayue Wang" <arthur0421 at gmail.com<mailto:arthur0421 at gmail.com>> escribió:
Thanks Andrew. I still don't understand where the tags <text id=""> and </text> should be added. Should they enclose a corpus file? I notice that section 7.6 "Metadata template" of the CQPwebAdminManual is empty. Could you show me a template?

Best,
Jiayue

On 14/11/16 09:38, Hardie, Andrew wrote:
Well it looks rather as if you don't have any text tags at all there... which would be part of the problem. Try again with <text id="...">...</text> tags added to the file, as required.

As for why indexing is taking so long, it's very difficult for me to diagnose at a distance. You should keep an eye on your process list (e.g. via top) to see if anything is actually happening. As long as a cwb-*** process is running, something productive is happening, and you shouldn't abort.

best

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it<mailto:cwb-bounces at sslmit.unibo.it> [mailto:cwb-bounces at sslmit.unibo.it<mailto:cwb-bounces at sslmit.unibo.it>] On Behalf Of Jiayue Wang
Sent: 13 November 2016 11:06
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Can't create metadata

Hi Andrew,

Thanks a lot.
I deleted the us_rhodeisland corpus and tried again to install it. The
corpus file looks like this:

If      IN      if
you     PP      you
have    VBP     have
any     DT      any
questions       NNS     question
or      CC      or
suggestions     NNS     suggestion
how     WRB     how
this    DT      this
website NN      website
might   MD      might
be      VB      be
improved        VBN     improve
,       ,       ,
please  VB      please
feel    VB      feel
free    JJ      free
to      TO      to
contact VB      contact
us      PP      us
.       SENT    .

The corpus contains only this file (44.0 MB). For P-attribute I selected
the POS and lemma (TreeTagger format) option. Then I clicked Install, 31
files were created in the index/us_rhodeisland folder, but the process
goes on endlessly. I interrupted this process and tried again but the
same happened. I'm wondering how long time does this approximately take
on my laptop, which has 8 GB of ram, and a, Intel i5 quadcore CPU?

Best
Jiayue

On 13/11/16 06:19, Hardie, Andrew wrote:
This error message suggests that your <text> elements lack valid ID
codes.

The most likely reason for [UNREADABLE] is that you have declared a
primary annotation, e.g. a part of speech tag, but the annotation in
question does not exist. This can happen if you use a template that
your data does not match, for instance.

best

Andrew.

-----Original Message----- From: cwb-bounces at sslmit.unibo.it<mailto:cwb-bounces at sslmit.unibo.it>
[mailto:cwb-bounces at sslmit.unibo.it<mailto:cwb-bounces at sslmit.unibo.it>] On Behalf Of Jiayue Wang Sent:
11 November 2016 20:17 To: Open source development of the Corpus
WorkBench Subject: [CWB] Can't create metadata

Hi,

After a full installation of CQBweb I installed a corpus called
"us_rhodeisland" (including 2 files, a raw text, and a TreeTagger
tagged text) without metadata. Since I have no idea what a metadata
file looks like, I selected "No thanks, I'll run this myself (safer
for very large corpora)" and clicked "Create minimalist metadata
table" and saw the following error message:


A MySQL query did not run successfully!


Original query: insert into
___temp_cqp_text_positions_for_us_rhodeisland (text_id, cqp_begin,
cqp_end) VALUES ('', 0, 55858),('', 55859, 3058358) /* from User:
admin | Function: do_append_mysql_comment() | 2016-Nov-11 20:04:20
*/


Error # 1062: Duplicate entry '' for key 'PRIMARY'


BTW, when I try a standard query, each concordance line begins with
"[UNREADABLE] [UNREADABLE] [UNREADABLE]". What is the most likely
reason?

Any help is appreciated, thanks!

Jiayue Wang _______________________________________________ CWB
mailing list CWB at sslmit.unibo.it<mailto:CWB at sslmit.unibo.it>
http://liste.sslmit.unibo.it/mailman/listinfo/cwb
_______________________________________________ CWB mailing list
CWB at sslmit.unibo.it<mailto:CWB at sslmit.unibo.it>
http://liste.sslmit.unibo.it/mailman/listinfo/cwb
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it<mailto:CWB at sslmit.unibo.it>
http://liste.sslmit.unibo.it/mailman/listinfo/cwb
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it<mailto:CWB at sslmit.unibo.it>
http://liste.sslmit.unibo.it/mailman/listinfo/cwb
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it<mailto:CWB at sslmit.unibo.it>
http://liste.sslmit.unibo.it/mailman/listinfo/cwb
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20161114/fa5d24a3/attachment-0001.html>


More information about the CWB mailing list