[CWB] TT-CWB-ENCODE

Thu Dec 10 13:18:44 CET 2015

For people wanting to work with CQP and XML files - the public part of TEITOK does just that: define what in your XML files counts as a token, compile a CQP corpus directly from the XML files, and then run a CQP query to extract portions of the XML files. The public part is mostly usable for command-line use; for a embedding in a full XML-based web environment supporing CQP seaches see http://teitok.corpuswiki.org which is currently a private repository for security reasons.

The package consists of two parts:

* tt-cwb-encode is an XML alternative to cwb-encode and takes XML files rather than VRT files as input. Apart from the regular binary CQP files, it writes two additional types of files: XX_xidx.rng for pattributes and all sattributes, which have the same structure and other .rng files, but point to byte-offsets in the XML file for each corpus position and structure. And text_id.idx, which is a file relating each corpus position to a <text> range in the text_id.avx, where text_id is the name of the XML file. 

* tt-cwb-xidx allows extracting parts of the XML files based on CQP corpus positions. For any CQP corpus that contains the two additional types of files mentioned above, tt-cwb-xidx takes corpus postions as input, and outputs the corresponding XLM fragment. 

The repository contains two tiny corpora as a demonstration (one in TEITOK format, the other in pure TEI format), and after compiling the Fontaine corpus (The Ant and the Grasshopper from the original French publication), you can run the following from the command line:
echo 'A = [word="a.*"]; tabulate A 0 100 match, matchend; ' |  cqp -i -D TT-FONTAINE | tt-cwb-xidx
This will run the CQP query search for all words starting with an a, feed the corpus positions of the match to tt-cwb-xidx, which will give the full XML of the <tok> elements in the original file for the first 100 matches. CQP queries can use full CQL, and both CQP and tt-cwb-xidx can expand the context (tt-cwb-xidx will automatically restrict results to the same XML file).   

Due to the indexing files, tt-cwb-xidx keeps working fast for corpora with either large XML files (tested on a corpus with texts of over 35k tokens or over 20Mb but there is no real size limitation), or with large amount of different XML files (tested on a corpus with over 5k different files). Since tt-cwb-encode has to parse XML files it is somewhat slower than cwb-encode, but is still sufficiently fast for most purposes using the light-weight pugixml code.

The repository can be found here:

https://gitlab.com/maartenes/TT-CWB

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20151210/4299667b/attachment.html>