[CWB] Can't create metadata

Hannah Kermes h.kermes at mx.uni-saarland.de
Mon Nov 14 16:46:30 CET 2016


As Andrew said. You can't nest <text> elements. In the case of labeling 
smaller units as <text>, the larger units are not enclosed in <text> 
elements in these cases we used an attribute to mark the elements 
belonging to the larger unit.

But for the beginning it is easier to stick to "Texts" as <text> elements.

Ciao, ciao

Hannah


Am 14.11.2016 um 16:33 schrieb Hardie, Andrew:
> You can't nest <text> elements!
>
> If you want to delineate sub-text units, use some other tag: e.g. <section type="XXX"> or something like that.
>
> Andrew.
>
> -----Original Message-----
> From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Jiayue Wang
> Sent: 14 November 2016 15:27
> To: Open source development of the Corpus WorkBench
> Subject: Re: [CWB] Can't create metadata
>
> Thanks Hannah. Do you mean in those corpora both whole texts and their sections were enclosed between tags, something like <text><text>...</text><text>...</text></text>?
>
> On 14/11/16 14:20, Hannah Kermes wrote:
>> Hi Jiayue,
>>
>> the <text>-elements are used for the build-in distribution of CQPweb,
>> so it makes sence to ask yourself what is most usefully enclosed in
>> these elements.
>>
>> Usually, you will enclose every text in your corpus in a separate
>> <text>-element, this could be articles, essays, whole books, depending
>> on what your corpus consists of. But we also had corpora where we
>> enclosed smaller units, e.g. chapters of a book or utterances in
>> <text>-elements to be able to use the build-in distribution.
>>
>> The metadata allow to group the texts into different subcorpora (e.g.
>> author_sex, year, register, genre). Each column (in the
>> tab-deliminated
>> file) or each attribute in the <text>-elment stands for a different
>> set of subcorpora (author_sex: male, female; register: academic, news,
>> ...)
>>
>> Best
>>
>> Hannah
>>
>>
>> Am 14.11.2016 um 15:08 schrieb Hardie, Andrew:
>>> Daniel's sample of a datafile exemplifies one of the two methods for
>>> more-than-minimal text metadata. This can either be loaded from a
>>> tab-delimited file, or deduced from XML. The latter method is the one
>>> Daniel exemplifies.
>>>
>>>
>>>
>>> For /minimal/ metadata you only require text with the ID attribute
>>> (whose values must be /handles/, i.e. just letters, numbers with no
>>> space / punctuation).
>>>
>>>
>>>
>>> It is a rule of CQPweb corpora that the whole corpus needs to occur
>>> within <text> elements, each of which must have an id, and there
>>> can't be any words that are not inside a <text> element. If you don't
>>> care about text boundaries, you can just wrap the whole corpus in one
>>> <text id="CORPUS"> . </text>
>>>
>>>
>>>
>>> This is explained in my paper:
>>>
>>>
>>>
>>>    * Hardie, Andrew (2012). CQPweb - combining power, flexibility and
>>>      usability in a corpus analysis tool
>>>      <http://www.ingentaconnect.com/content/jbp/ijcl/2012/00000017/00000003/art00004>. /International
>>>      Journal of Corpus Linguistics/ 17 (3): 380-409. [alternative link]
>>>      <http://www.lancs.ac.uk/staff/hardiea/cqpweb-paper.pdf>
>>>
>>>
>>>
>>> Sorry it's not written up in the manual yet, only so many hours in a
>>> day alas.
>>>
>>>
>>>
>>> best
>>>
>>>
>>>
>>> Andrew.
>>>
>>>
>>>
>>>
>>>
>>> *From:*cwb-bounces at sslmit.unibo.it
>>> [mailto:cwb-bounces at sslmit.unibo.it] *On Behalf Of *Daniel Renau
>>> *Sent:* 14 November 2016 13:46
>>> *To:* Open source development of the Corpus WorkBench
>>> *Subject:* Re: [CWB] Can't create metadata
>>>
>>>
>>>
>>> Hi jiayue,
>>>
>>> My team works with verticalized texts like this:
>>>
>>> <text id="ST1" title="namewithoutspaces" author="name"> <s> word pos
>>> lemma word pos lemma word pos lemma word pos lemma </s> </text>
>>>
>>> <text id="ST2" title="anothertextname" author="otherperson"> <s> word
>>> pos lemma word pos lemma word pos lemma </s> </text>
>>>
>>> You can add more text tags as: author_sex, language, year, translator...
>>>
>>>
>>>
>>> El 14 nov. 2016 2:37 p. m., "Jiayue Wang" <arthur0421 at gmail.com
>>> <mailto:arthur0421 at gmail.com>> escribió:
>>>
>>> Thanks Andrew. I still don't understand where the tags <text id="">
>>> and </text> should be added. Should they enclose a corpus file? I
>>> notice that section 7.6 "Metadata template" of the CQPwebAdminManual
>>> is empty. Could you show me a template?
>>>
>>> Best,
>>> Jiayue
>>>
>>> On 14/11/16 09:38, Hardie, Andrew wrote:
>>>
>>> Well it looks rather as if you don't have any text tags at all
>>> there... which would be part of the problem. Try again with <text
>>> id="...">...</text> tags added to the file, as required.
>>>
>>> As for why indexing is taking so long, it's very difficult for me to
>>> diagnose at a distance. You should keep an eye on your process list
>>> (e.g. via top) to see if anything is actually happening. As long as a
>>> cwb-*** process is running, something productive is happening, and
>>> you shouldn't abort.
>>>
>>> best
>>>
>>> Andrew.
>>>
>>> -----Original Message-----
>>> From: cwb-bounces at sslmit.unibo.it
>>> <mailto:cwb-bounces at sslmit.unibo.it>
>>> [mailto:cwb-bounces at sslmit.unibo.it
>>> <mailto:cwb-bounces at sslmit.unibo.it>] On Behalf Of Jiayue Wang
>>> Sent: 13 November 2016 11:06
>>> To: Open source development of the Corpus WorkBench
>>> Subject: Re: [CWB] Can't create metadata
>>>
>>> Hi Andrew,
>>>
>>> Thanks a lot.
>>> I deleted the us_rhodeisland corpus and tried again to install it.
>>> The corpus file looks like this:
>>>
>>> If      IN      if
>>> you     PP      you
>>> have    VBP     have
>>> any     DT      any
>>> questions       NNS     question
>>> or      CC      or
>>> suggestions     NNS     suggestion
>>> how     WRB     how
>>> this    DT      this
>>> website NN      website
>>> might   MD      might
>>> be      VB      be
>>> improved        VBN     improve
>>> ,       ,       ,
>>> please  VB      please
>>> feel    VB      feel
>>> free    JJ      free
>>> to      TO      to
>>> contact VB      contact
>>> us      PP      us
>>> .       SENT    .
>>>
>>> The corpus contains only this file (44.0 MB). For P-attribute I
>>> selected the POS and lemma (TreeTagger format) option. Then I clicked
>>> Install, 31 files were created in the index/us_rhodeisland folder,
>>> but the process goes on endlessly. I interrupted this process and
>>> tried again but the same happened. I'm wondering how long time does
>>> this approximately take on my laptop, which has 8 GB of ram, and a, Intel i5 quadcore CPU?
>>>
>>> Best
>>> Jiayue
>>>
>>> On 13/11/16 06:19, Hardie, Andrew wrote:
>>>
>>> This error message suggests that your <text> elements lack valid ID
>>> codes.
>>>
>>> The most likely reason for [UNREADABLE] is that you have declared a
>>> primary annotation, e.g. a part of speech tag, but the annotation in
>>> question does not exist. This can happen if you use a template that
>>> your data does not match, for instance.
>>>
>>> best
>>>
>>> Andrew.
>>>
>>> -----Original Message----- From: cwb-bounces at sslmit.unibo.it
>>> <mailto:cwb-bounces at sslmit.unibo.it>
>>> [mailto:cwb-bounces at sslmit.unibo.it
>>> <mailto:cwb-bounces at sslmit.unibo.it>] On Behalf Of Jiayue Wang Sent:
>>> 11 November 2016 20:17 To: Open source development of the Corpus
>>> WorkBench Subject: [CWB] Can't create metadata
>>>
>>> Hi,
>>>
>>> After a full installation of CQBweb I installed a corpus called
>>> "us_rhodeisland" (including 2 files, a raw text, and a TreeTagger
>>> tagged text) without metadata. Since I have no idea what a metadata
>>> file looks like, I selected "No thanks, I'll run this myself (safer
>>> for very large corpora)" and clicked "Create minimalist metadata
>>> table" and saw the following error message:
>>>
>>>
>>> A MySQL query did not run successfully!
>>>
>>>
>>> Original query: insert into
>>> ___temp_cqp_text_positions_for_us_rhodeisland (text_id, cqp_begin,
>>> cqp_end) VALUES ('', 0, 55858),('', 55859, 3058358) /* from User:
>>> admin | Function: do_append_mysql_comment() | 2016-Nov-11 20:04:20 */
>>>
>>>
>>> Error # 1062: Duplicate entry '' for key 'PRIMARY'
>>>
>>>
>>> BTW, when I try a standard query, each concordance line begins with
>>> "[UNREADABLE] [UNREADABLE] [UNREADABLE]". What is the most likely
>>> reason?
>>>
>>> Any help is appreciated, thanks!
>>>
>>> Jiayue Wang _______________________________________________ CWB
>>> mailing list CWB at sslmit.unibo.it <mailto:CWB at sslmit.unibo.it>
>>> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>>> _______________________________________________ CWB mailing list
>>> CWB at sslmit.unibo.it <mailto:CWB at sslmit.unibo.it>
>>> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>>>
>>> _______________________________________________
>>> CWB mailing list
>>> CWB at sslmit.unibo.it <mailto:CWB at sslmit.unibo.it>
>>> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>>> _______________________________________________
>>> CWB mailing list
>>> CWB at sslmit.unibo.it <mailto:CWB at sslmit.unibo.it>
>>> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>>>
>>> _______________________________________________
>>> CWB mailing list
>>> CWB at sslmit.unibo.it <mailto:CWB at sslmit.unibo.it>
>>> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>>>
>>>
>>>
>>> _______________________________________________
>>> CWB mailing list
>>> CWB at sslmit.unibo.it
>>> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>>
>>
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it
>> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb



More information about the CWB mailing list