[CWB] TEITOK

Maarten Janssen maartenpt at gmail.com
Thu Nov 19 15:39:40 CET 2015


Hi Stefan and Andrew,

thanks for the answers! Here are some responses:

> TEITOK looks like an excellent tool ? can we put a link to the server on the CWB homepage?  

Of course you can; I would be pleased if you did - the people that are using it seem quite pleased with it, so there is definitely a “market” for it.

> Also, having a mostly automated TEI converter program would be really useful.

TEITOK is not really a TEI converter, and depending on what you want to convert you have to follow a different path:

- The internal structure TEITOK uses it uses is not really TEI, although it is TEI compliant; there are too many options in TEI to really work with it directly, and what is specifically not used is the P4+ style <w> elements where annotation is modeled as text-nodes under child nodes. Instead, it uses the “older” style of <w> where annotations are attributes (to make sure they are always strings), and calls them <tok> rather than <w> to avoid confusion (and since <w> typically excludes punctuation marks, while tokens do not). So to use TEITOK, you either have to start from a TEI file that is not tokenized (TEITOK has an XML tokenizer to create TEITOK-style tokenize TEI), or convert the TEI file to TEITOK style (in Ljubljana they wrote an XSLT that does excatly that), after which tt-cwb-encode will directly create a CQP corpus for you.

- tt-cwb-encode can be used to direclty convert most TEI flavours to a CQP corpus (I should provide an  example settings file with it to show how to convert a typical <w> style TEI file to CQP), but tt-cwb-encode does not tokenize, so for doing that, you would need a file that IS already tokenized (and annotated), and specify exactly which information can be found where in your TEI file. 

>>>> - the technical manual quite explicitly states that structures cannot embed or overlap; however, the logic of .rng files does not seem to invalidate that in any way.
>> 
>> *Different* attributes can embed and overlap. But instances of one attribute can't embed with, or overlap with, other instances of the same attribute. And yes, it is not the structure of the binary files but rather the way they are used that prevents that.
> 
> Well, the unpublished file format specification ? which I assume you mean by the "logic of .rng files" ? mandates that regions don't nest or overlap: the integer values in a .rng file must form an increasing sequence.  If you violate the file format, bad things will happen (i.e. undefined behaviour of CQP and the other CWB tools).

I by now fully implemented it and I can confirm that that is indeed a hard requirement: if you created two overlapping ranges, one from tokens 4-6 with error_type=“agreement” and one from 5-7 with error_type=“collocation” (generated in the example I tried from stand-off annotation files where ranges can overlap), then only token 7 will be a “collocation” error, while 4-6 are only “agreement” errors. However, at least from simple tests, it does not in any way seem to break CWB - it just ignores any token inside a range <x> that was already inside another range <x>.  

>> For that reason, TEITOK since this week uses a custom c++ application to directly build the files needed by cwb-makeall from the XML files.
> 
> Does that mean you actually create the binary data files (in uncompressed form) from your application, without going through the appropriate CWB tools?  You shouldn't do that, and I can't think of any good reason for doing it.[*]  One of the obvious consequences is that any file format changes ? such as those envisioned for CWB 4, will completely break your program, and it will be much harder to adapt than if you were using the CWB encoder tools.
> 
> If you create .rng files through with the appropriate cwb-s-encode utility, it will stop you from generating overlapping or nested regions.
> 
> [*] Ok, there's one fairly good reason if you're dealing with very large corpora: it may be more efficient to write files directly than to open pipes to a large number of cwb-encode and cwb-s-encode backends.  But I'm really not sure that this makes up for the loss in maintainability and reliability.

Yes - tt-cwb-encode directly writes binary files; I initially wanted to use cwb-atoi (and later hence cwb-s-encode), but when opening up the code in that, I saw the conversion is so trivial that there was simply not need for the overhead (which would also involve making sure the application can be found, etc.). It is a simple function, which can easily be modified to a call to cwb-atoi on a major overhaul, or just implemented slighly differently (a direct copy would not really word since tt-cwb-encode is C++ and not C) 

// Write CWB network style
void write_network_number ( int towrite, FILE *stream ) {
	int i = htonl(towrite);
	fwrite(&i, 4, 1, stream);
};

The same holds for ranges, although that is of course vaguely more complicated. However, most of the work is in finding out what range to write in the first place, the 10 lines for 
void write_range ( int pos1, int pos2, string formkey ) 
do not really add to the complexity and can also be modified in the future when needed.

Also - I would hope that if CWB gets a major overhaul, the implementation of ranges could be rethought as well, which would probably mean that even cwb-s-encode would break. Here is a "suggestion”:

Apart from allowing overlaps and/or nestings, the application of sattributes is hampered by the fact that they are so very different from pattributes, which means many of the nice functions on pattributes are not applicable to sattributes (I think even regex is not available for sattributes). In my opinion, the language would become much more expressive by blurring the distinction between p and s, and adopting a notation ala XPath where before the brackets you can indicate the range type (with nothing meaning a token), to allow for queries like

np[case=“nominative|ergative"] [pos=“V.*”]

and since these are ranges, they can of course nested:

mwe[type=“name” [pos=“CC"]]

which seems not only more elegant to me than [pos=“CC”] :: mwe_type=“name” but also should be more expressive...

The difference with the current search style is not that big (and it should not affect backward compatibility), and since a new file format would require looking up data compeltely differently anyway, it might be worth while to profit from that to treat sattributes more like pattributes…. in the current set-up they are very similar behind the screens: the lexicon.idx file is largely the same as the .avx file and the .lexicon mimicks the .avs file, the only real difference being that of course .corpus indicates positions and .rng ranges. However, internally they are treated very differently, and there is no range-based variant of .rvs for instance. But from the looks of it, there is little preventing sattributes from being treated mostly like pattributes - and of course, there would be major implications when you would try to implement that in the current CWB, but when making dramatic changes anyway, would it not be possible to look into that?

>>>> - ideally, the CQP tokens would direclty point to indexes in the XML files to make it possible to efficiently extract the matching data directly from the XML files. An inelegant method would be to add two pattributes for this, but would there be any more elegant way to link tokens in CQP to ranges in external files?
>> 
>> Not any that I can think of. 
> 
> Nor I.  But that's not surprising, given that XML itself doesn't have an elegant way of linking to external files and is forced to use XPointers or other verbose and horrible concoctions.
> 
> You could store XML IDs of the relevant elements as p-attributes, or byte offsets into the XML files (for better efficiency and flexibility). None of these solutions is efficient in CWB 3 ? they'll be much better in CWB 4 with "raw string" and "integer" attribute types.

Keeping the IDs is what TEITOK (and CorpusWiki) have done from the start, and is why results from CQL queries link directly to their result in the XML file; however, when showing long lists of results, it would be very nice to be able to show the initial XML context rather than the CQP output, since CQP does not do spacing, not does it do typesetting. And every implementation I tried (including writing a dedicated app) still ends up being to slow for internet use: a list of 100 results takes several seconds to load, which is not acceptable. So what I was/am looking for is indeed a way to store byte-offsets. But I’ll just either put these in a CQP pattribute then or in an external index (potentially using the CWB format for coherence).




More information about the CWB mailing list