[CWB] TEITOK

Maarten Janssen maartenpt at gmail.com
Tue Nov 17 13:41:46 CET 2015


TEITOK is an online platform that allows searching through texts using CQP as a backbone for the search queries, not unlike CQPWeb and a range of similar implementations. However, what makes TEITOK fundamentally different from those is that it uses CQP only as a search index, whereas the primary data are stored in full-fledged XML files, (ideally) in TEI format. TEITOK is primarily meant for corpora with heavy "non-linguistic" mark-up, such as typographic information, deletions, changes of hand, and even aligned sound files, facsimile images, etc.  This leads to corpora which are not only of interest for linguistic research, but also for other purposes, such as historic research (in case of historic corpora), teaching, or just (in the case of less-resourced languages) to have accessible language material. 

In TEITOK, token information is added in-line to the XML file in a <tok> element. The CQP corpus is built periodically from the XML files, and he XML files can contain a lot of data that is not exported to the CQP corpus. Any CQP token is coindexed with a token in one of the XML files, meaning that any result from a CQL query links directly to a word in an XML file, where all the additional information can be visualized (in the browser). 

TEITOK is not only a visualization platform, but also allows modifying the XML files, and even makes it possible to use CQL queries to look for potential errors in the corpus and directly correct them via the online interface. TEITOK is used for a growing number of widely varying corpora, including corpora of mediaeval manuscripts, learner corpora, spoken corpora, and corpora for less-resourced languages, and to be able to be usable in this wide range of projects, TEITOK has a large number of customization options.

More information about the system can be found on : http://teitok.corpuswiki.org or just contact me directly with questions directly

Until recently, TEITOK created the CQP corpus by first exporting a .vrt file, where roughly each <tok> corresponds to a line, and each attribute on the <tok> to a column, and then using cwb-encode to build the actual CQP corpus. However, flattening XML files into VRT files becomes increasingly complicated as more information is to be exported - it is not trivial to insert <s> annotations when not all the tokens in the corpus are inside an <s>. And recently, I added the option in TEITOK to include stand-off annotation files, either plain range-based annotations or even PSDX style syntacic annotations, which are hard if not impossible to export to VRT. For that reason, TEITOK since this week uses a custom c++ application to directly build the files needed by cwb-makeall from the XML files. This program, called tt-cwb-encode, is still too specific for the TEITOK environment, but should soon become available for people wanting to build a CQP corpus directly from XML files. 

Now while building tt-cwb-encode, I ran into some questions which hopefully someone here might be able to respond to:

- a structural attribute like text_id has a .rng file that is always identical to the text.rng, why are the files necessary?

- the technical manual quite explicitly states that structures cannot embed or overlap; however, the logic of .rng files does not seem to invalidate that in any way. Is there something internal in CQL that makes this impossible, or is that just a side-effect of using non-XML input from .vrt files and can tt-cwb-encode create embedded and overlapping ranges without breaking CQL?

- the news that CWB 4 might use a completely different architecture (Zyggo) is somewhat disconcerning, since it might break a lot of things. When is this change planned for (roughly), and how backwards compatible is it intended to be?

- ideally, the CQP tokens would direclty point to indexes in the XML files to make it possible to efficiently extract the matching data directly from the XML files. An inelegant method would be to add two pattributes for this, but would there be any more elegant way to link tokens in CQP to ranges in external files? Or would it just be better to have such lookup files outside of CQP?




More information about the CWB mailing list