[CWB] Can't create metadata

Hardie, Andrew a.hardie at lancaster.ac.uk
Mon Nov 14 16:33:50 CET 2016


You can't nest <text> elements!

If you want to delineate sub-text units, use some other tag: e.g. <section type="XXX"> or something like that.

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Jiayue Wang
Sent: 14 November 2016 15:27
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Can't create metadata

Thanks Hannah. Do you mean in those corpora both whole texts and their sections were enclosed between tags, something like <text><text>...</text><text>...</text></text>?

On 14/11/16 14:20, Hannah Kermes wrote:
> Hi Jiayue,
>
> the <text>-elements are used for the build-in distribution of CQPweb, 
> so it makes sence to ask yourself what is most usefully enclosed in 
> these elements.
>
> Usually, you will enclose every text in your corpus in a separate 
> <text>-element, this could be articles, essays, whole books, depending 
> on what your corpus consists of. But we also had corpora where we 
> enclosed smaller units, e.g. chapters of a book or utterances in 
> <text>-elements to be able to use the build-in distribution.
>
> The metadata allow to group the texts into different subcorpora (e.g.
> author_sex, year, register, genre). Each column (in the 
> tab-deliminated
> file) or each attribute in the <text>-elment stands for a different 
> set of subcorpora (author_sex: male, female; register: academic, news, 
> ...)
>
> Best
>
> Hannah
>
>
> Am 14.11.2016 um 15:08 schrieb Hardie, Andrew:
>>
>> Daniel's sample of a datafile exemplifies one of the two methods for 
>> more-than-minimal text metadata. This can either be loaded from a 
>> tab-delimited file, or deduced from XML. The latter method is the one 
>> Daniel exemplifies.
>>
>>
>>
>> For /minimal/ metadata you only require text with the ID attribute 
>> (whose values must be /handles/, i.e. just letters, numbers with no 
>> space / punctuation).
>>
>>
>>
>> It is a rule of CQPweb corpora that the whole corpus needs to occur 
>> within <text> elements, each of which must have an id, and there 
>> can't be any words that are not inside a <text> element. If you don't 
>> care about text boundaries, you can just wrap the whole corpus in one 
>> <text id="CORPUS"> . </text>
>>
>>
>>
>> This is explained in my paper:
>>
>>
>>
>>   * Hardie, Andrew (2012). CQPweb - combining power, flexibility and
>>     usability in a corpus analysis tool
>>     <http://www.ingentaconnect.com/content/jbp/ijcl/2012/00000017/00000003/art00004>. /International
>>     Journal of Corpus Linguistics/ 17 (3): 380-409. [alternative link]
>>     <http://www.lancs.ac.uk/staff/hardiea/cqpweb-paper.pdf>
>>
>>
>>
>> Sorry it's not written up in the manual yet, only so many hours in a 
>> day alas.
>>
>>
>>
>> best
>>
>>
>>
>> Andrew.
>>
>>
>>
>>
>>
>> *From:*cwb-bounces at sslmit.unibo.it
>> [mailto:cwb-bounces at sslmit.unibo.it] *On Behalf Of *Daniel Renau
>> *Sent:* 14 November 2016 13:46
>> *To:* Open source development of the Corpus WorkBench
>> *Subject:* Re: [CWB] Can't create metadata
>>
>>
>>
>> Hi jiayue,
>>
>> My team works with verticalized texts like this:
>>
>> <text id="ST1" title="namewithoutspaces" author="name"> <s> word pos 
>> lemma word pos lemma word pos lemma word pos lemma </s> </text>
>>
>> <text id="ST2" title="anothertextname" author="otherperson"> <s> word 
>> pos lemma word pos lemma word pos lemma </s> </text>
>>
>> You can add more text tags as: author_sex, language, year, translator...
>>
>>
>>
>> El 14 nov. 2016 2:37 p. m., "Jiayue Wang" <arthur0421 at gmail.com 
>> <mailto:arthur0421 at gmail.com>> escribió:
>>
>> Thanks Andrew. I still don't understand where the tags <text id=""> 
>> and </text> should be added. Should they enclose a corpus file? I 
>> notice that section 7.6 "Metadata template" of the CQPwebAdminManual 
>> is empty. Could you show me a template?
>>
>> Best,
>> Jiayue
>>
>> On 14/11/16 09:38, Hardie, Andrew wrote:
>>
>> Well it looks rather as if you don't have any text tags at all 
>> there... which would be part of the problem. Try again with <text 
>> id="...">...</text> tags added to the file, as required.
>>
>> As for why indexing is taking so long, it's very difficult for me to 
>> diagnose at a distance. You should keep an eye on your process list 
>> (e.g. via top) to see if anything is actually happening. As long as a
>> cwb-*** process is running, something productive is happening, and 
>> you shouldn't abort.
>>
>> best
>>
>> Andrew.
>>
>> -----Original Message-----
>> From: cwb-bounces at sslmit.unibo.it 
>> <mailto:cwb-bounces at sslmit.unibo.it>
>> [mailto:cwb-bounces at sslmit.unibo.it
>> <mailto:cwb-bounces at sslmit.unibo.it>] On Behalf Of Jiayue Wang
>> Sent: 13 November 2016 11:06
>> To: Open source development of the Corpus WorkBench
>> Subject: Re: [CWB] Can't create metadata
>>
>> Hi Andrew,
>>
>> Thanks a lot.
>> I deleted the us_rhodeisland corpus and tried again to install it. 
>> The corpus file looks like this:
>>
>> If      IN      if
>> you     PP      you
>> have    VBP     have
>> any     DT      any
>> questions       NNS     question
>> or      CC      or
>> suggestions     NNS     suggestion
>> how     WRB     how
>> this    DT      this
>> website NN      website
>> might   MD      might
>> be      VB      be
>> improved        VBN     improve
>> ,       ,       ,
>> please  VB      please
>> feel    VB      feel
>> free    JJ      free
>> to      TO      to
>> contact VB      contact
>> us      PP      us
>> .       SENT    .
>>
>> The corpus contains only this file (44.0 MB). For P-attribute I 
>> selected the POS and lemma (TreeTagger format) option. Then I clicked 
>> Install, 31 files were created in the index/us_rhodeisland folder, 
>> but the process goes on endlessly. I interrupted this process and 
>> tried again but the same happened. I'm wondering how long time does 
>> this approximately take on my laptop, which has 8 GB of ram, and a, Intel i5 quadcore CPU?
>>
>> Best
>> Jiayue
>>
>> On 13/11/16 06:19, Hardie, Andrew wrote:
>>
>> This error message suggests that your <text> elements lack valid ID 
>> codes.
>>
>> The most likely reason for [UNREADABLE] is that you have declared a 
>> primary annotation, e.g. a part of speech tag, but the annotation in 
>> question does not exist. This can happen if you use a template that 
>> your data does not match, for instance.
>>
>> best
>>
>> Andrew.
>>
>> -----Original Message----- From: cwb-bounces at sslmit.unibo.it 
>> <mailto:cwb-bounces at sslmit.unibo.it>
>> [mailto:cwb-bounces at sslmit.unibo.it
>> <mailto:cwb-bounces at sslmit.unibo.it>] On Behalf Of Jiayue Wang Sent:
>> 11 November 2016 20:17 To: Open source development of the Corpus 
>> WorkBench Subject: [CWB] Can't create metadata
>>
>> Hi,
>>
>> After a full installation of CQBweb I installed a corpus called 
>> "us_rhodeisland" (including 2 files, a raw text, and a TreeTagger 
>> tagged text) without metadata. Since I have no idea what a metadata 
>> file looks like, I selected "No thanks, I'll run this myself (safer 
>> for very large corpora)" and clicked "Create minimalist metadata 
>> table" and saw the following error message:
>>
>>
>> A MySQL query did not run successfully!
>>
>>
>> Original query: insert into
>> ___temp_cqp_text_positions_for_us_rhodeisland (text_id, cqp_begin,
>> cqp_end) VALUES ('', 0, 55858),('', 55859, 3058358) /* from User:
>> admin | Function: do_append_mysql_comment() | 2016-Nov-11 20:04:20 */
>>
>>
>> Error # 1062: Duplicate entry '' for key 'PRIMARY'
>>
>>
>> BTW, when I try a standard query, each concordance line begins with
>> "[UNREADABLE] [UNREADABLE] [UNREADABLE]". What is the most likely
>> reason?
>>
>> Any help is appreciated, thanks!
>>
>> Jiayue Wang _______________________________________________ CWB
>> mailing list CWB at sslmit.unibo.it <mailto:CWB at sslmit.unibo.it>
>> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>> _______________________________________________ CWB mailing list
>> CWB at sslmit.unibo.it <mailto:CWB at sslmit.unibo.it>
>> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>>
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it <mailto:CWB at sslmit.unibo.it>
>> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it <mailto:CWB at sslmit.unibo.it>
>> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>>
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it <mailto:CWB at sslmit.unibo.it>
>> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>>
>>
>>
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it
>> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>
>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://liste.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list