[CWB] TEITOK

Fri Nov 20 15:17:24 CET 2015

> On 20 Nov 2015, at 14:14, Hardie, Andrew <a.hardie at lancaster.ac.uk> wrote:
> 
> Yes - tt-cwb-encode directly writes binary files; I initially wanted to use cwb-atoi (and later hence cwb-s-encode), 
> <<<
> I think there is a misunderstanding here. cwb-s-encode's input format is *textual*, i.e. it takes a text file with two columns of integers-as-text. IF you did pipe your data through cwb-atoi, you would not get something that can be used by cwb-s-encode.

I think that Maarten meant the "later" chronologically, i.e. first he wanted to use cwb-atoi to convert a sequence of integers for the .rng file to CWB encoding; then he considered using cwb-s-encode to create all index files for the s-attribute; but finally he decided to write the binary files directly from his own program.

> Also - I would hope that if CWB gets a major overhaul, the implementation of ranges could be rethought as well, which would probably mean that even cwb-s-encode would break. 
> <<<
> 
> I suggest you read the Ziggurat proposal.

Yes, I was also going to suggest this!

> We have already decided how this is going to change. cwb-s-encode is unlikely to exist at all in CWB 4.

Without having made any concrete plans for CWB4 input formats, I expect that there will be a tool very similar to cwb-s-encode at least at the Ziggurat level, i.e. a tool which allows you to create a new layer with a given set of variables, and with base layer links specified in terms of layer positions.  For a segmentation layer with a single string variable, the input of this tool might look exactly like the input of cwb-s-encode.

(At least that's what I imagine for the first set of encoding tools. More sophisticated encoders to be added at a later stage. :)

> Here is a "suggestion”:

> np[case=“nominative|ergative"] [pos=“V.*”]
> 
> and since these are ranges, they can of course nested:
> 
> mwe[type=“name” [pos=“CC"]]
> 
> which seems not only more elegant to me than [pos=“CC”] :: mwe_type=“name” but also should be more expressive…

I don't see yet how these would allow you to do anything you can't do in CQP syntax.  Your first query should be equivalent to

	<np_case = "nominative|ergative"> []* </np_case> [pos="V.*"]

which will probably change to 

	<np case="nominative|ergative"> []* </np> [pos="V.*"]

in CWB4 and run considerably faster there.

Is there any way in which the second example differs from

	<mwe_type = "name"> []* [pos="CC"] []* </mwe_type>

?  You seem to think that the query should only return the token marked [pos="CC"], but that seems to be a rather bold interpretation to expect from the query engine.

> I don't think we intend - at least in the first instance - make changes to the CQP search syntax anywhere near as radical as what you suggest. The first job will be to get CQP working much as it does at present, but with a Ziggurat database as its backend. Subsequently to that, we will develop new capabilities to exploit the things that Ziggurat can do that the present format can't.

Any suggestions on how CQP syntax can be extended – without any radical changes – to support the new features of Ziggurat/CWB4 are highly welcome, though.  The query language is very much tied to contiguous sequences of tokens, so it's difficult to fit in hierarchical structures.

Your second example is a good case in point.  CQP syntax doesn't express hierarchical relationships between regions, so queries of this type can be inelegant and tend to be inefficient (because CQP can't easily see the obvious optimization).

Best,
Stefan