[CWB] Tei header tags in CWB queries

Stefan Evert stefan.evert at uos.de
Sun Oct 15 22:24:30 CEST 2006


On 15 Oct 2006, at 22:07, lars nygaard wrote:

> Stefania,
>
> The best way is to store the information about meta information  
> (such as gender) for each speaker in a database, along with start  
> and stop corpus positions. Then you can create a subcorpus file  
> with corpus positions and import it into cqp with the UNDUMP  
> command. Please refer to the Cqp query tutorial for more  
> information. If you do not already have tools to do this, I will be  
> happy to provide you with the necessary scripts.
>

Or you have to break down the speaker information to each <u> region,  
indeed. (It's not that much worse, since you have to store the  
speaker IDs and some other information in those tags, anyway. TEI is  
right in not duplicating information that you may want to maintain  
and edit, but when it's just for query purposes, putting the  
information where you look for it makes a lot of things easier.)

The new version of BNCweb takes both approaches, by the way.  When  
encoding the BNC for use with CQP, speaker information is attached  
directly to the <u> regions.  In a next step, the speaker information  
table is read into a relational database, where most metadata  
restrictions are computed.  You can then either use a list of speaker  
IDs obtained from the database and insert it into your CQP query  
(provided that list doesn't get too long), or construct a subcorpus  
containing the relevant <u> regions (from an additional table in the  
database) and undump it into CQP, as Lars suggested.  BNCweb does the  
latter in order to achieve better performance and stability for very  
complex metadata restrictions.

Hope this helps a little,
Stefan

PS: This sounds like it would be a good idea to write generic scripts  
for encoding TEI corpora in the CWB and some tools for querying  
speaker information with the help of a relational database, doesn't  
it? Any volunteers? :o) I guess that part of the BNCweb code could be  
reused, but it's still unfinished and probably too specific for a  
generic solution.





More information about the CWB mailing list