[CWB] TEITOK

Hardie, Andrew a.hardie at lancaster.ac.uk
Tue Nov 17 14:29:53 CET 2015


>>- a structural attribute like text_id has a .rng file that is always identical to the text.rng, why are the files necessary?

In a Platonic sense, they aren't. The reason they exist is backwards compatibility, since originally, each s-attribute instance could have only one value. The CWB engine does not actually know that text_id is related to text; the labels are arbitrary; thus the range files have to be duplicated. The mechanism of creating subsidiary s-attributes with a "_" to handle XML attributes, so each range could in practice have multiple values, is a retrofit. It won't be used in CWB4.

>>- the technical manual quite explicitly states that structures cannot embed or overlap; however, the logic of .rng files does not seem to invalidate that in any way.

*Different* attributes can embed and overlap. But instances of one attribute can't embed with, or overlap with, other instances of the same attribute. And yes, it is not the structure of the binary files but rather the way they are used that prevents that.

>>- the news that CWB 4 might 

Not might. Will.

>>use a completely different architecture (Zyggo) is somewhat disconcerning, since it might break a lot of things. 

Not might. Will. 

In fact, it will break a large part of your system as you have described it. Your tt-cwb-encode will *definitely* not work with CWB4, since the file format it generates won't be used (indeed cwb-encode and cwb-makeall as presently constituted won't exist in CWB4...)

>>When is this change planned for (roughly), and how backwards compatible is it intended to be?

Backwards compatibility: (a) in terms of query syntax, almost entirely; (b) in terms of CL calls, largely; (c) in terms of the use of the cwb-utils, to a fair degree; (c) in terms of low-level internal mechanisms and file formats, not at all.

Applications that use CWB3 via the defined interfaces will need some tweaks to work with CWB4 but hopefully nothing too major. Applications that fiddle directly with the binary files, like yours, will break completely and you will need to rewrite them from scratch to work with Ziggurat files (there will be a separate Ziggurat library you can use for this). 

Timing: we hope to start rolling out 3.9 series versions of CWB within the next year (he said, fingers crossed hard). 3.9 will become 4.0 once we're reasonably sure it's feature complete and stable.

All that is your bad news; your good news is that we will maintain 3.5 (the release version of the current sequence of 3.4 development versions) for at least several years. 

>>- ideally, the CQP tokens would direclty point to indexes in the XML files to make it possible to efficiently extract the matching data directly from the XML files. An inelegant method would be to add two pattributes for this, but would there be any more elegant way to link tokens in CQP to ranges in external files?

Not any that I can think of. 

best

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Maarten Janssen
Sent: 17 November 2015 12:42
To: cwb at sslmit.unibo.it
Subject: [CWB] TEITOK


More information about the CWB mailing list