[CWB] TEITOK

Thu Nov 19 10:43:06 CET 2015

Hi Maarten,

TEITOK looks like an excellent tool – can we put a link to the server on the CWB homepage?  Also, having a mostly automated TEI converter program would be really useful.

Here are a few small addenda to Andrew's excellent answer:

> On 17 Nov 2015, at 14:29, Hardie, Andrew <a.hardie at lancaster.ac.uk> wrote:
> 
>>> - the technical manual quite explicitly states that structures cannot embed or overlap; however, the logic of .rng files does not seem to invalidate that in any way.
> 
> *Different* attributes can embed and overlap. But instances of one attribute can't embed with, or overlap with, other instances of the same attribute. And yes, it is not the structure of the binary files but rather the way they are used that prevents that.

Well, the unpublished file format specification – which I assume you mean by the "logic of .rng files" – mandates that regions don't nest or overlap: the integer values in a .rng file must form an increasing sequence.  If you violate the file format, bad things will happen (i.e. undefined behaviour of CQP and the other CWB tools).

> For that reason, TEITOK since this week uses a custom c++ application to directly build the files needed by cwb-makeall from the XML files.

Does that mean you actually create the binary data files (in uncompressed form) from your application, without going through the appropriate CWB tools?  You shouldn't do that, and I can't think of any good reason for doing it.[*]  One of the obvious consequences is that any file format changes – such as those envisioned for CWB 4, will completely break your program, and it will be much harder to adapt than if you were using the CWB encoder tools.

If you create .rng files through with the appropriate cwb-s-encode utility, it will stop you from generating overlapping or nested regions.

[*] Ok, there's one fairly good reason if you're dealing with very large corpora: it may be more efficient to write files directly than to open pipes to a large number of cwb-encode and cwb-s-encode backends.  But I'm really not sure that this makes up for the loss in maintainability and reliability.

>>> - ideally, the CQP tokens would direclty point to indexes in the XML files to make it possible to efficiently extract the matching data directly from the XML files. An inelegant method would be to add two pattributes for this, but would there be any more elegant way to link tokens in CQP to ranges in external files?
> 
> Not any that I can think of. 

Nor I.  But that's not surprising, given that XML itself doesn't have an elegant way of linking to external files and is forced to use XPointers or other verbose and horrible concoctions.

You could store XML IDs of the relevant elements as p-attributes, or byte offsets into the XML files (for better efficiency and flexibility).  None of these solutions is efficient in CWB 3 – they'll be much better in CWB 4 with "raw string" and "integer" attribute types.

Best,
Stefan