[CWB] Adding metada to corpus via CQPWeb

Wed Oct 17 00:16:35 CEST 2012

Hi Martí,

There is a bit of a misunderstanding here concerning what is meant by "metadata" from CQPweb's perspective. From this perspective "metadata" means "information regarding each text, held separately". It does not mean "The information coded in the XML" - except that, of course, texts have to have the text_id attribute to link them to the metadata rows.

So - because the code attribute is on the lang element, it *does not* form part of your metadata. 

CQPweb's awareness of XML currently leaves rather a lot to be desired as XML-element-level-data cannot currently be added in. When I get a reasonable chunk of programming time, I intend to address this.

So you do not actually have any "metadata" in the relevant sense. You should just use the "create minimalist metadata" option.

For reference if you do ever install a corpus with text-level metadata:

- handle? lang or lang_code?
--> The handle can be whatever you like.
- description (a free description or the way it will appear in the search/query interface)
--> Both: it's a free description which will appear in the interface.
- classification or free text is clear (but where do I declare my
classifications?)
--> You don't declare your classifications. They are deduced from the file input.
- how should I decide which is the primary field (I would say it is text_id, which apparently is default)
--> It's whichever classification you want to appear most prominently in the interface (NOT text_id, because that's not a classification field)

best

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Martí Quixal
Sent: 16 October 2012 02:43
To: cwb at sslmit.unibo.it
Subject: [CWB] Adding metada to corpus via CQPWeb

Hi all,

I just installed my first corpus, but I did not manage to associate metadata to it. Instructions say I should prepare a separate file for the metadata where the first column contains the text_id (one line per text). My corpus has several text, with different text ids.

Currently I am only using two different types of metadata (I am playing around still)

text has the attribute id, which an id like AF002, etc.

lang has the attribute code, which currently can only be en (but I foresee that it can have en, es, fr...)

How should my metadata file look like? Like this? (I write \tab cause I cannot use tabs)

AF002 \tab en
AF003 \tab en
AF004 \tab en
AF006 \tab en
...

That sounds a bit weird.

The other thing is I don't quite understand the terminology used in the form to add metadata in the CQPWeb interface:

- handle? lang or lang_code?
- description (a free description or the way it will appear in the search/query interface)
- classification or free text is clear (but where do I declare my
classifications?)
- how should I decide which is the primary field (I would say it is text_id, which apparently is default)

Just for info, the corpus I am testing the process with looks like this:

<text id="AF002">
buenas  bueno   ADJ
tardes  tarde   NC
estamos estar   VEfin
aquí    aquí    ADV
con     con     PREP
X    X   NC
gracias gracia  NC
por     por     PREP
hacer   hacer   VLinf
esta    este    DM
entrevista      entrevistar     VLfin
laura   laura   NC
cuándo  cuándo  ADV
y       y       CC
dónde   dónde   ADV
naciste nacer   VLfin
<lang code="en">
ok      ok      VV
um      um      RB
</lang>
nací    nacer   VLfin
en      en      PREP
1988    @card@  CARD
este    este    DM
nací    nacer   VLfin
aquí    aquí    ADV
en      en      PREP
el      el      ART
paso    paso    NC
<lang code="en">
texas   texas   NN
</lang>
en      en      PREP
octubre octubre NMON
(...)

Best regards,
Martí

--
Martí Quixal
Computational Linguist & Educational Technologist http://www.iqubo.org/quixal _______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb