[CWB] Re: Corpus size and filtering

Fri Apr 4 12:59:47 CEST 2008

Dear Stefan,
thank you for your reply. 

Yes, that is correct, the corpus contains XML/XHTML tags which we do not want to be taken as
tokens. For example, each text file contains an index with 54 attributes (author, publication
date,  type of text, etc.). If I understand your reply, by default all of these are ignored by
CQP, which is great. 
However, the XML/XHTML tags are necessary for the user to compose his/her own sub-corpus. So,
someone might be interested in assembling a collection of texts by a certain author and/or texts
of a certain date, and query only over this collection of texts. Hopefully this can also be done
in CWB/CQP.

Many thanks,
Rui

> Do I understand correctly that your corpus contains XML or XHTML tags  
> that you normally don't want to display to users?
> 
> The CWB doesn't treat XML tags as separate tokens (if you've declared  
> them properly), but simply remembers the start and end position of  
> each XML region. By default, the tags won't be displayed at all, but  
> you can switch on individual tags if you like, and you can use the  
> region boundaries in CQP queries.
> 
> The CQP Query Language and Corpus Encoding tutorials have some more  
> information on data formats and handling of XML tags.

      ____________________________________________________________________________________
You rock. That's why Blockbuster's offering you one month of Blockbuster Total Access, No Cost.  
http://tc.deals.yahoo.com/tc/blockbuster/text5.com