[CWB] Re: Corpus size and filtering

Fri Apr 4 14:45:14 CEST 2008

> Yes, that is correct, the corpus contains XML/XHTML tags which we  
> do not want to be taken as
> tokens. For example, each text file contains an index with 54  
> attributes (author, publication
> date,  type of text, etc.). If I understand your reply, by default  
> all of these are ignored by
> CQP, which is great.

Yes, but you have to declare them as "structural" attributes,  
otherwise they will be inserted as separate tokens (with warning  
messages).

> However, the XML/XHTML tags are necessary for the user to compose  
> his/her own sub-corpus. So,
> someone might be interested in assembling a collection of texts by  
> a certain author and/or texts
> of a certain date, and query only over this collection of texts.  
> Hopefully this can also be done
> in CWB/CQP.

You can directly restrict searches in CQP if you make sure that the  
data are suitably encoded. In particular, the XML regions have to  
span the entire text they apply to -- you can't just have a header at  
the beginning of the file as in the TEI/XCES standard (the reason  
being that CQP is not an XML query tool but a corpus search engine,  
which deals with tokens and text regions only).

If you have very complex restrictions or users will typically search  
on a relatively small subset of the full corpus, a better solution is  
to store the metadata separately in a relational database and let  
your user interface handle the integration.  This approach has been  
used very successfully by BNCweb, which constructs a SQL query from  
the metadata restrictions, retrieves a list of matching texts from  
its MySQL database, and then runs the CQP query on the corresponding  
subcorpus. There are some technical niceties to make this work  
properly, but we can help you with it if you get to that point.

Best,
Stefan