[CWB] TT-CWB-ENCODE

Thu Dec 10 15:53:32 CET 2015

By the way, https://metacpan.org/pod/XML::TMX::CWB might be useful for the
people interested into encoding TMX files.

best,
ambs

On Thu, Dec 10, 2015 at 12:18 PM, Maarten Janssen <maartenpt at gmail.com>
wrote:

> For people wanting to work with CQP and XML files - the public part of
> TEITOK does just that: define what in your XML files counts as a token,
> compile a CQP corpus directly from the XML files, and then run a CQP query
> to extract portions of the XML files. The public part is mostly usable for
> command-line use; for a embedding in a full XML-based web environment
> supporing CQP seaches see http://teitok.corpuswiki.org which is currently
> a private repository for security reasons.
>
> The package consists of two parts:
>
> * *tt-cwb-encode* is an XML alternative to cwb-encode and takes XML files
> rather than VRT files as input. Apart from the regular binary CQP files, it
> writes two additional types of files: XX_xidx.rng for pattributes and all
> sattributes, which have the same structure and other .rng files, but point
> to byte-offsets in the XML file for each corpus position and structure. And
> text_id.idx, which is a file relating each corpus position to a <text>
> range in the text_id.avx, where text_id is the name of the XML file.
>
> * *tt-cwb-xidx* allows extracting parts of the XML files based on CQP
> corpus positions. For any CQP corpus that contains the two additional types
> of files mentioned above, tt-cwb-xidx takes corpus postions as input, and
> outputs the corresponding XLM fragment.
>
> The repository contains two tiny corpora as a demonstration (one in TEITOK
> format, the other in pure TEI format), and after compiling the Fontaine
> corpus (The Ant and the Grasshopper from the original French publication),
> you can run the following from the command line:
>
> echo 'A = [word="a.*"]; tabulate A 0 100 match, matchend; ' |  cqp -i -D TT-FONTAINE | tt-cwb-xidx
>
> This will run the CQP query search for all words starting with an a, feed
> the corpus positions of the match to tt-cwb-xidx, which will give the full
> XML of the <tok> elements in the original file for the first 100 matches.
> CQP queries can use full CQL, and both CQP and tt-cwb-xidx can expand the
> context (tt-cwb-xidx will automatically restrict results to the same XML
> file).
>
> Due to the indexing files, tt-cwb-xidx keeps working fast for corpora with
> either large XML files (tested on a corpus with texts of over 35k tokens or
> over 20Mb but there is no real size limitation), or with large amount of
> different XML files (tested on a corpus with over 5k different files).
> Since tt-cwb-encode has to parse XML files it is somewhat slower than
> cwb-encode, but is still sufficiently fast for most purposes using the
> light-weight pugixml code.
>
> The repository can be found here:
>
> https://gitlab.com/maartenes/TT-CWB
>
>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20151210/77cbe2e3/attachment.html>