[CWB] Adding metada to corpus via CQPWeb
Hardie, Andrew
a.hardie at lancaster.ac.uk
Wed Oct 17 00:16:35 CEST 2012
Hi Martí,
There is a bit of a misunderstanding here concerning what is meant by "metadata" from CQPweb's perspective. From this perspective "metadata" means "information regarding each text, held separately". It does not mean "The information coded in the XML" - except that, of course, texts have to have the text_id attribute to link them to the metadata rows.
So - because the code attribute is on the lang element, it *does not* form part of your metadata.
CQPweb's awareness of XML currently leaves rather a lot to be desired as XML-element-level-data cannot currently be added in. When I get a reasonable chunk of programming time, I intend to address this.
So you do not actually have any "metadata" in the relevant sense. You should just use the "create minimalist metadata" option.
For reference if you do ever install a corpus with text-level metadata:
- handle? lang or lang_code?
--> The handle can be whatever you like.
- description (a free description or the way it will appear in the search/query interface)
--> Both: it's a free description which will appear in the interface.
- classification or free text is clear (but where do I declare my
classifications?)
--> You don't declare your classifications. They are deduced from the file input.
- how should I decide which is the primary field (I would say it is text_id, which apparently is default)
--> It's whichever classification you want to appear most prominently in the interface (NOT text_id, because that's not a classification field)
best
Andrew.
-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Martí Quixal
Sent: 16 October 2012 02:43
To: cwb at sslmit.unibo.it
Subject: [CWB] Adding metada to corpus via CQPWeb
Hi all,
I just installed my first corpus, but I did not manage to associate metadata to it. Instructions say I should prepare a separate file for the metadata where the first column contains the text_id (one line per text). My corpus has several text, with different text ids.
Currently I am only using two different types of metadata (I am playing around still)
text has the attribute id, which an id like AF002, etc.
lang has the attribute code, which currently can only be en (but I foresee that it can have en, es, fr...)
How should my metadata file look like? Like this? (I write \tab cause I cannot use tabs)
AF002 \tab en
AF003 \tab en
AF004 \tab en
AF006 \tab en
...
That sounds a bit weird.
The other thing is I don't quite understand the terminology used in the form to add metadata in the CQPWeb interface:
- handle? lang or lang_code?
- description (a free description or the way it will appear in the search/query interface)
- classification or free text is clear (but where do I declare my
classifications?)
- how should I decide which is the primary field (I would say it is text_id, which apparently is default)
Just for info, the corpus I am testing the process with looks like this:
<text id="AF002">
buenas bueno ADJ
tardes tarde NC
estamos estar VEfin
aquí aquí ADV
con con PREP
X X NC
gracias gracia NC
por por PREP
hacer hacer VLinf
esta este DM
entrevista entrevistar VLfin
laura laura NC
cuándo cuándo ADV
y y CC
dónde dónde ADV
naciste nacer VLfin
<lang code="en">
ok ok VV
um um RB
</lang>
nací nacer VLfin
en en PREP
1988 @card@ CARD
este este DM
nací nacer VLfin
aquí aquí ADV
en en PREP
el el ART
paso paso NC
<lang code="en">
texas texas NN
</lang>
en en PREP
octubre octubre NMON
(...)
Best regards,
Martí
--
Martí Quixal
Computational Linguist & Educational Technologist http://www.iqubo.org/quixal _______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb
More information about the CWB
mailing list