[CWB] TEITOK

Hardie, Andrew a.hardie at lancaster.ac.uk
Fri Nov 20 14:14:48 CET 2015


Hi Maarten,

>>>
Yes - tt-cwb-encode directly writes binary files; I initially wanted to use cwb-atoi (and later hence cwb-s-encode), 
<<<

I think there is a misunderstanding here. cwb-s-encode's input format is *textual*, i.e. it takes a text file with two columns of integers-as-text. IF you did pipe your data through cwb-atoi, you would not get something that can be used by cwb-s-encode.

>>>
Also - I would hope that if CWB gets a major overhaul, the implementation of ranges could be rethought as well, which would probably mean that even cwb-s-encode would break. 
<<<

I suggest you read the Ziggurat proposal. We have already decided how this is going to change. cwb-s-encode is unlikely to exist at all in CWB 4.

See http://cwb.sourceforge.net/cwb4.php

However, new c-attributes ("constituency") will eventually be available to represent more complex XML trees.

>>>
Here is a "suggestion”:
<<<

I don't think we intend - at least in the first instance - make changes to the CQP search syntax anywhere near as radical as what you suggest. The first job will be to get CQP working much as it does at present, but with a Ziggurat database as its backend. Subsequently to that, we will develop new capabilities to exploit the things that Ziggurat can do that the present format can't.

I also don't see much margin in moving in an XPath-like direction. 

>>>
it might be worth while to profit from that to treat sattributes more like pattributes…. in the current set-up they are very similar behind the screens: the lexicon.idx file is largely the same as the .avx file and the .lexicon mimicks the .avs file, the only real difference being that of course .corpus indicates positions and .rng ranges. However, internally they are treated very differently, and there is no range-based variant of .rvs for instance. But from the looks of it, there is little preventing sattributes from being treated mostly like pattributes - and of course, there would be major implications when you would try to implement that in the current CWB, but when making dramatic changes anyway, would it not be possible to look into that?
<<<

You *really* should read the Ziggurat specification esp. its revised treatment of different types of data. In short, s-attributes will be implemented as Ziggurat segmentation layers. Annotations on s-attribute regions will be treated as string variables on a segmentation layer (equivalent to p-attributes being represented as string variables on the base layer).

best

Andrew.




More information about the CWB mailing list