[CWB] Question about metadata

Hardie, Andrew a.hardie at lancaster.ac.uk
Thu Feb 12 15:11:43 CET 2015


Hi Josep,

You would need to use ".*racial conflicts.*" as the regex.

Consider using | as the delimiter. Then, the special operator "contains" for set values can be used, allowing simpler patterns (behind the scenes, this just adds .* at the start and end). IE "|politics|UK|society|" allows 'contains "politics"' to be used as the query.

Cf. sec 5.6 http://cwb.sourceforge.net/files/CQP_Tutorial.pdf  (most of the examples are for p-attributes, but some are for valued s-attributes)

V4 will (eventually) have set attributes as a distinct datatype that store and index each value in a set separately, rather than applying a regex hack to a regular single-string value...

best

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Josep M. Fontana
Sent: 12 February 2015 13:14
To: Open source development of the Corpus WorkBench
Subject: [CWB] Question about metadata

Hi,

We are trying to make it easy for users to add new texts to a corpus of texts used in language courses. One of the things that would make the corpus more useful would be to be able to have keywords related to content type that could be used to select texts to do a search or to do any of the other operations that are possible with specific metadata fields (e.g. frequencies of a certain expression in texts of type X vs. 
texts of type Y).

The problem is that it is a bit hard to classify a text with a single label and therefore restricting a particular field to only one category is rather limiting. What would be ideal would be to have fields where the person introducing the text would be able to add different keywords separated by commas as in the field 'ct' (for Content Type) below:


<doc title="Nice Title" id="C-03" century="20" ch="2'-1" ct="culture, politics, racial conflicts, US" >

<doc title="Another Nice Title" id="C-04" century="20" ch="2'-1" 
ct="politics, UK, society" >

Would that be a problem for CQP? Could CQP make use of partial segments of the information contained in a field? For instance, if such kind of metadata was introduced in the headings, would a query like the following be possible?

::match.doc_ct="racial conflicts";


Thanks in advance.

Josep M.
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list