[CWB] Can't create metadata

Hannah Kermes h.kermes at mx.uni-saarland.de
Mon Nov 14 15:20:37 CET 2016


Hi Jiayue,

the <text>-elements are used for the build-in distribution of CQPweb, so 
it makes sence to ask yourself what is most usefully enclosed in these 
elements.

Usually, you will enclose every text in your corpus in a separate 
<text>-element, this could be articles, essays, whole books, depending 
on what your corpus consists of. But we also had corpora where we 
enclosed smaller units, e.g. chapters of a book or utterances in 
<text>-elements to be able to use the build-in distribution.

The metadata allow to group the texts into different subcorpora (e.g. 
author_sex, year, register, genre). Each column (in the tab-deliminated 
file) or each attribute in the <text>-elment stands for a different set 
of subcorpora (author_sex: male, female; register: academic, news, ...)

Best

Hannah


Am 14.11.2016 um 15:08 schrieb Hardie, Andrew:
>
> Daniel’s sample of a datafile exemplifies one of the two methods for 
> more-than-minimal text metadata. This can either be loaded from a 
> tab-delimited file, or deduced from XML. The latter method is the one 
> Daniel exemplifies.
>
> For /minimal/ metadata you only require text with the ID attribute 
> (whose values must be /handles/, i.e. just letters, numbers with no 
> space / punctuation).
>
> It is a rule of CQPweb corpora that the whole corpus needs to occur 
> within <text> elements, each of which must have an id, and there can’t 
> be any words that are not inside a <text> element. If you don’t care 
> about text boundaries, you can just wrap the whole corpus in one <text 
> id="CORPUS"> … </text>
>
> This is explained in my paper:
>
>   * Hardie, Andrew (2012). CQPweb – combining power, flexibility and
>     usability in a corpus analysis tool
>     <http://www.ingentaconnect.com/content/jbp/ijcl/2012/00000017/00000003/art00004>.
>     /International Journal of Corpus Linguistics/ 17 (3): 380-409.
>     [alternative link]
>     <http://www.lancs.ac.uk/staff/hardiea/cqpweb-paper.pdf>
>
> Sorry it’s not written up in the manual yet, only so many hours in a 
> day alas…
>
> best
>
> Andrew.
>
> *From:*cwb-bounces at sslmit.unibo.it 
> [mailto:cwb-bounces at sslmit.unibo.it] *On Behalf Of *Daniel Renau
> *Sent:* 14 November 2016 13:46
> *To:* Open source development of the Corpus WorkBench
> *Subject:* Re: [CWB] Can't create metadata
>
> Hi jiayue,
>
> My team works with verticalized texts like this:
>
> <text id="ST1" title="namewithoutspaces" author="name">
> <s>
> word pos lemma
> word pos lemma
> word pos lemma
> word pos lemma
> </s>
> </text>
>
> <text id="ST2" title="anothertextname" author="otherperson">
> <s>
> word pos lemma
> word pos lemma
> word pos lemma
> </s>
> </text>
>
> You can add more text tags as: author_sex, language, year, translator...
>
> El 14 nov. 2016 2:37 p. m., "Jiayue Wang" <arthur0421 at gmail.com 
> <mailto:arthur0421 at gmail.com>> escribió:
>
> Thanks Andrew. I still don't understand where the tags <text id=""> 
> and </text> should be added. Should they enclose a corpus file? I 
> notice that section 7.6 "Metadata template" of the CQPwebAdminManual 
> is empty. Could you show me a template?
>
> Best,
> Jiayue
>
> On 14/11/16 09:38, Hardie, Andrew wrote:
>
> Well it looks rather as if you don't have any text tags at all 
> there... which would be part of the problem. Try again with <text 
> id="...">...</text> tags added to the file, as required.
>
> As for why indexing is taking so long, it's very difficult for me to 
> diagnose at a distance. You should keep an eye on your process list 
> (e.g. via top) to see if anything is actually happening. As long as a 
> cwb-*** process is running, something productive is happening, and you 
> shouldn't abort.
>
> best
>
> Andrew.
>
> -----Original Message-----
> From: cwb-bounces at sslmit.unibo.it <mailto:cwb-bounces at sslmit.unibo.it> 
> [mailto:cwb-bounces at sslmit.unibo.it 
> <mailto:cwb-bounces at sslmit.unibo.it>] On Behalf Of Jiayue Wang
> Sent: 13 November 2016 11:06
> To: Open source development of the Corpus WorkBench
> Subject: Re: [CWB] Can't create metadata
>
> Hi Andrew,
>
> Thanks a lot.
> I deleted the us_rhodeisland corpus and tried again to install it. The
> corpus file looks like this:
>
> If      IN      if
> you     PP      you
> have    VBP     have
> any     DT      any
> questions       NNS     question
> or      CC      or
> suggestions     NNS     suggestion
> how     WRB     how
> this    DT      this
> website NN      website
> might   MD      might
> be      VB      be
> improved        VBN     improve
> ,       ,       ,
> please  VB      please
> feel    VB      feel
> free    JJ      free
> to      TO      to
> contact VB      contact
> us      PP      us
> .       SENT    .
>
> The corpus contains only this file (44.0 MB). For P-attribute I selected
> the POS and lemma (TreeTagger format) option. Then I clicked Install, 31
> files were created in the index/us_rhodeisland folder, but the process
> goes on endlessly. I interrupted this process and tried again but the
> same happened. I'm wondering how long time does this approximately take
> on my laptop, which has 8 GB of ram, and a, Intel i5 quadcore CPU?
>
> Best
> Jiayue
>
> On 13/11/16 06:19, Hardie, Andrew wrote:
>
> This error message suggests that your <text> elements lack valid ID
> codes.
>
> The most likely reason for [UNREADABLE] is that you have declared a
> primary annotation, e.g. a part of speech tag, but the annotation in
> question does not exist. This can happen if you use a template that
> your data does not match, for instance.
>
> best
>
> Andrew.
>
> -----Original Message----- From: cwb-bounces at sslmit.unibo.it 
> <mailto:cwb-bounces at sslmit.unibo.it>
> [mailto:cwb-bounces at sslmit.unibo.it 
> <mailto:cwb-bounces at sslmit.unibo.it>] On Behalf Of Jiayue Wang Sent:
> 11 November 2016 20:17 To: Open source development of the Corpus
> WorkBench Subject: [CWB] Can't create metadata
>
> Hi,
>
> After a full installation of CQBweb I installed a corpus called
> "us_rhodeisland" (including 2 files, a raw text, and a TreeTagger
> tagged text) without metadata. Since I have no idea what a metadata
> file looks like, I selected "No thanks, I'll run this myself (safer
> for very large corpora)" and clicked "Create minimalist metadata
> table" and saw the following error message:
>
>
> A MySQL query did not run successfully!
>
>
> Original query: insert into
> ___temp_cqp_text_positions_for_us_rhodeisland (text_id, cqp_begin,
> cqp_end) VALUES ('', 0, 55858),('', 55859, 3058358) /* from User:
> admin | Function: do_append_mysql_comment() | 2016-Nov-11 20:04:20
> */
>
>
> Error # 1062: Duplicate entry '' for key 'PRIMARY'
>
>
> BTW, when I try a standard query, each concordance line begins with
> "[UNREADABLE] [UNREADABLE] [UNREADABLE]". What is the most likely
> reason?
>
> Any help is appreciated, thanks!
>
> Jiayue Wang _______________________________________________ CWB
> mailing list CWB at sslmit.unibo.it <mailto:CWB at sslmit.unibo.it>
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
> _______________________________________________ CWB mailing list
> CWB at sslmit.unibo.it <mailto:CWB at sslmit.unibo.it>
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it <mailto:CWB at sslmit.unibo.it>
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it <mailto:CWB at sslmit.unibo.it>
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it <mailto:CWB at sslmit.unibo.it>
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>
>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20161114/bd52c222/attachment-0001.html>


More information about the CWB mailing list