[CWB] Question about metadata

Josep M. josepm.fontana at upf.edu
Thu Feb 12 17:15:55 CET 2015


Great. Thank you! This is really not a bad solution.

JM
> Hi Josep,
>
> You would need to use ".*racial conflicts.*" as the regex.
>
> Consider using | as the delimiter. Then, the special operator "contains" for set values can be used, allowing simpler patterns (behind the scenes, this just adds .* at the start and end). IE "|politics|UK|society|" allows 'contains "politics"' to be used as the query.
>
> Cf. sec 5.6 http://cwb.sourceforge.net/files/CQP_Tutorial.pdf  (most of the examples are for p-attributes, but some are for valued s-attributes)
>
> V4 will (eventually) have set attributes as a distinct datatype that store and index each value in a set separately, rather than applying a regex hack to a regular single-string value...
>
> best
>
> Andrew.
>
> -----Original Message-----
> From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Josep M. Fontana
> Sent: 12 February 2015 13:14
> To: Open source development of the Corpus WorkBench
> Subject: [CWB] Question about metadata
>
> Hi,
>
> We are trying to make it easy for users to add new texts to a corpus of texts used in language courses. One of the things that would make the corpus more useful would be to be able to have keywords related to content type that could be used to select texts to do a search or to do any of the other operations that are possible with specific metadata fields (e.g. frequencies of a certain expression in texts of type X vs.
> texts of type Y).
>
> The problem is that it is a bit hard to classify a text with a single label and therefore restricting a particular field to only one category is rather limiting. What would be ideal would be to have fields where the person introducing the text would be able to add different keywords separated by commas as in the field 'ct' (for Content Type) below:
>
>
> <doc title="Nice Title" id="C-03" century="20" ch="2'-1" ct="culture, politics, racial conflicts, US" >
>
> <doc title="Another Nice Title" id="C-04" century="20" ch="2'-1"
> ct="politics, UK, society" >
>
> Would that be a problem for CQP? Could CQP make use of partial segments of the information contained in a field? For instance, if such kind of metadata was introduced in the headings, would a query like the following be possible?
>
> ::match.doc_ct="racial conflicts";
>
>
> Thanks in advance.
>
> Josep M.
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb



More information about the CWB mailing list