[CWB] Example of metadata file?

Hardie, Andrew a.hardie at lancaster.ac.uk
Tue Jan 10 15:24:43 CET 2017


I have just spent an hour or two expanding the discussion of metadata in the admin manual, and updated it in SVN and on the website.

http://cwb.sourceforge.net/files/CQPwebAdminManual.pdf 

Chapters 6/7 are now slightly more complete than they were though there is still plenty to do. (as there always is).

One thing still not covered extensively is the ID-link datatype. So here are some notes on that.

The idea is that this offers a layer of *indirection* for the XML. 

The paradigmatic case of an ID link is speaker metadata.

In a spoken corpus you have lots of utterances (<u>) and you very often want to do operations within certain utterances and not others based on features of the speakers. EG you might want to search only within speech by males, or by people in a particular age group.

You COULD add XML attributes for each of these things ie

<u speaker_age="12" speaker_sex="male">Hello!</u>

But this is not a terribly good design, because age/sex are not features of UTTERANCES, they are features of SPEAKERS. This speaker will always be male and 12 in this corpus, so why is it necessary to repeat this on every utterance? The answer is, it is not.

The IDLINK datatype allows us to model this kind of indirection. Instead of marking speaker features on utterances, we can have a separate table for speakers...

ID	age	sex
===============
A001  12    m
A002  65    f

.... which is then referred to by the IDLINK attribute.

<u who="A001">Hello!</u>

So, instead of the data chain going Utterance -> sex , there is another layer of indirection: Utterance -> speaker -> sex.

It's called an IDLINK because once we declare the datatype of s-attribute u_who to be IDLINK, we promise CQPweb that an IDLINK metadata table, with all the right IDs listed, will be available. That is,  that the content of the IDLINK  (u_who) always LINKS to an ID that exists elsewhere (in the Speaker metadata table, which is therefore an IDLINK metadata table). 

All this is then opaque to the general user, who can specify " find instance of word X where speaker is male " in a restricted query, for instance - CQPweb will then

- use the IDLINK table to look up the IDs of the speakers where sex = m
- use the CWB index to find the list of regions in the corpus where u_who is equal to one or other of those IDs (IE utterances by one of those speakers)
- search within only those regions of the corpus for word X

Note this is SIMILAR to how text metadata works (if you search within genre "fiction", then  CQPweb looks up the texts where the "genre" column contains "fiction", and searches only within those texts) but not the SAME.

The key difference is that text_id is a unique identifier, ie each text ID occurs in the corpus once and exactly once. 

However, IDLINKS aren't unique. There can be many, many utterances where who="A001". A0001 is unique *in the Speaker metadata table*. This is why we talk about "u who" as an IDLINK rather than an ID: it is not an identifier, but something that links to an identifier.

===========================

Currently, it is not possible to have IDLINKS as a datatype for text metadata.

I was uncertain about this decision, as there is a clear use case: where one author writes many texts within the corpus, it would make sense for the "author" column in the text metadata table to contain an IDLINK to a separate Author metadata table which would contain things like author sex, age, domicile etc.

I decided against this for two reasons.

First, this system for doing Restricted Queries based on things like utterances was already *very* difficult to make work. Making it possible for there to be ANOTHER layer of indirection might have driven me mad. Wibble.

Second, if you look at corpora in practice, people tend not to mind making the sex of an author, say, part of the text metadata rather than having the author-people as a separate data entity. This is the case for the written BNC for instance - in CQPweb's predecessor, BNCweb, "sex of author" is a "written restriction" (IE text metadata).  So I just went along with this way of doing it.

now, after all that background, re Chao's question:

Since the assignments are texts, features of the students who wrote them could be included as text metadata columns.

And if one text metadata column in a unique id for the speaker, that makes it easier down the line to track the progress of individual students (you can say things like "crfeate a subcorpus of texts where student=A001 in module=101, and compare a subcorpus of texts where student=A001 and module=201.  - and do many other comparisons, if you have the right metadata for the texts.)

This is what Jiayue recommended, but hopefully the reasons why what you want here is text metadata rather than an idlink now make sense.

As for the actual format of the input files for metadata. My 2012 paper has an example in it, 

http://www.ingentaconnect.com/content/jbp/ijcl/2012/00000017/00000003/art00004 

(also available at http://www.lancs.ac.uk/staff/hardiea/cqpweb-paper.pdf  if you don't have access via the link above link )

Table 2 == metadata table for texts in BE06.

The file format is desc4ribed in section 7.7. of the CQPweb admin manual.

(The 2012 paper doesn’t cover idlinks, because I only came up with the system last year)

Hope this helps.

best

Andrew.


-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Jiayue Wang
Sent: 10 January 2017 07:38
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Example of metadata file?

Hi

My solution would be to use a metadata file (ascii text file, tab 
separated values) like this:

A201	A	201
B201	B	201
C201	C	201
D201	D	201

The first column are the text_id's; the other columns are used to make 
"text categorisation". In this way the four texts are linked clearly to 
the same student.

Jiayue


On 10/01/17 03:33, Chao Sun wrote:
> Hello all,
>
> First time poster and also want to try if my subscription works. I do
> not have any linguistic background, so please be gentle if I am asking
> silly questions.
>
> I am wondering if any one could provide a comprehensive metadata file
> example with some brief explanation on how CQPWeb can utilise the
> information? I am particularly interest in the LinkID part and assuming
> this could be used for threading different articles in a corpus?
>
> In my example, handed in assignments from various semesters are compiled
> as a corpus, each assignment is a text file with a unique text_id. Is it
> possible to give each assignment various linkIDs to show how the student
> progress through all semesters? For instance, student 201 has four
> assignments in semester A, B, C, D. If I associate four columns of
> linkID (A201, B201, C201, D201) on all his four submissions, will I be
> able to analyse the progress/change in words for this individual student
> in CQPWeb?
>
> Not sure if this is how the linkID and other metadata are designed for,
> besides classification and description. Please correct me if this makes
> non-sense.
>
> Regards,
> Chao
>
> *Dr CHAO SUN *| Data Scientist
> Faculty of Arts and Social Sciences |* The University of Sydney*
> Rm N302 off Lobby J, Quadrangle A14 | The University of Sydney | NSW | 2006
> *T* +61 2 9351 4240  | *F* +61 2 9351 5333
> *E* chao.sun at sydney.edu.au
> <mailto:chao.sun at sydney.edu.au> | *W* sydney.edu.au <http://sydney.edu.au>
> CRICOS 00026A
>
>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://liste.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list