[CWB] Tei header tags in CWB queries
Stefan Evert
stefan.evert at uos.de
Sun Oct 15 22:24:30 CEST 2006
On 15 Oct 2006, at 22:07, lars nygaard wrote:
> Stefania,
>
> The best way is to store the information about meta information
> (such as gender) for each speaker in a database, along with start
> and stop corpus positions. Then you can create a subcorpus file
> with corpus positions and import it into cqp with the UNDUMP
> command. Please refer to the Cqp query tutorial for more
> information. If you do not already have tools to do this, I will be
> happy to provide you with the necessary scripts.
>
Or you have to break down the speaker information to each <u> region,
indeed. (It's not that much worse, since you have to store the
speaker IDs and some other information in those tags, anyway. TEI is
right in not duplicating information that you may want to maintain
and edit, but when it's just for query purposes, putting the
information where you look for it makes a lot of things easier.)
The new version of BNCweb takes both approaches, by the way. When
encoding the BNC for use with CQP, speaker information is attached
directly to the <u> regions. In a next step, the speaker information
table is read into a relational database, where most metadata
restrictions are computed. You can then either use a list of speaker
IDs obtained from the database and insert it into your CQP query
(provided that list doesn't get too long), or construct a subcorpus
containing the relevant <u> regions (from an additional table in the
database) and undump it into CQP, as Lars suggested. BNCweb does the
latter in order to achieve better performance and stability for very
complex metadata restrictions.
Hope this helps a little,
Stefan
PS: This sounds like it would be a good idea to write generic scripts
for encoding TEI corpora in the CWB and some tools for querying
speaker information with the help of a relational database, doesn't
it? Any volunteers? :o) I guess that part of the BNCweb code could be
reused, but it's still unfinished and probably too specific for a
generic solution.
More information about the CWB
mailing list