[CWB] TEITOK

Mon Nov 23 11:28:37 CET 2015

Hi Stefan,

I set up the program that builds CQP binaries from TEI as a public repository on GitLab: https://gitlab.com/maartenes/TT-CWB

It has not been extensively tested, but correctly builds a CQP corpus form a variety of TEI documents I checked. The repository comes with two examples, one example using the TEITOK format which needs very little configuration, and one using a more standard TEI <w> notation. The latter example should also exemplify how tt-cwb-encode can be used to build a corpus out of almost any XML encoded tokenized and annotation corpus, where the settings indicate what counts as a token in the XML, and where  the relevant annotations can be found.

Now as for searches on sattributes:

> I don't see yet how these would allow you to do anything you can't do in CQP syntax.  Your first query should be equivalent to
> 
> 	<np_case = "nominative|ergative"> []* </np_case> [pos="V.*"]
> 
> which will probably change to 
> 
> 	<np case="nominative|ergative"> []* </np> [pos="V.*"]
> 
> in CWB4 and run considerably faster there.
> 
> Is there any way in which the second example differs from
> 
> 	<mwe_type = "name"> []* [pos="CC"] []* </mwe_type>
> 
> ?  You seem to think that the query should only return the token marked [pos="CC"], but that seems to be a rather bold interpretation to expect from the query engine.

This is largely my fault since I somehow ended up always using sattributes in :: notation, where things like regex are not possible - CQL just has so many options that it is hard to keep a good overview, so the point below might be missing somethings as well, in which case my apologies. And no, the last example was not intended to only capture the CC: although you can in principle do that in XPath by using tok[pos="CC" and ancestor::mwe[type="name"]], that complicates the language considerably, and does not seem to add too much to CQL, given that you can always select a target node in CQL by using @[pos="CC"]. 

As long as you can say <mwe type="name" and subtype="geographical"> there is not much in between these two notations, the main difference being that CQL always matches the whole segment unless told otherwise. 

However, the use of XML tags in this way seems to make it more difficult to overcome the limitation that ranges cannot overlap or intersect. Take the following NP:

<np function=“sub">the man with <np>the binoculars</np>on <np>his head</np></np>

Now I don't think that the current set-up would allow you to modify the use of sattributes in such a way that the following query would hold, since there is an intervening </np> between the beginning of the <np> and the word “head”, and the logic of checking for opening and closing NP ranges does not seem very compatible with it:

[pos="V.*"] <np>[]* [word="head"] []*</np>

This because in the current CQL notation, there is no explicit statement that the </np> has to close the SAME range that was opened earlier. Unless interpreted as XML there is no link between the two tags, and as XML it would of course not allow for overlapping ranges. So in my opinion, the notation makes it more difficult to overcome the current limitation that ranges cannot overlap or intersect, which is of course not an issue, as long as there is a way around it. 

In contrast, the XPath like notation (which could of course use angular brackets, that is not the issue: <np function=“sub” and [word="head"]> ) would make it explicit that “head” has to appear inside a range of type “np” - without having to check in any way what else happens in side that range, or what that range intersects or overlaps with. I was going to write a tiny CQL fragment that would do exactly that, but than ran into some complications which would make me give up on it at least for now since it would not server too strong a purpose other than showing it can be done.