[CWB] TT-CQP

Maarten Janssen maartenpt at gmail.com
Thu Feb 1 11:03:42 CET 2018


As promised a while back, I released a custom version of cqp called tt-cqp - it is built from scratch in C++ and should be fully UTF compliant, but was built explicitly to read any existing (non-compressed) CQP corpus. It is probably not yet bug free, and more features are likely to be added in the near future, but you are welcome to try it out here:

https://gitlab.com/maartenes/TT-CWB/blob/master/TT-CQP.md 

I built it since there were too many things I needed for TEITOK that CQP does provide (at least not to my knowledge, but CQP does a lot, so I might have missed things), such as (partial) support for overlapping sattribute regions and sorting results on sattributes. And while I was at it, I also included some additional bells and whistles that might be useful for CWB as well - I know CWB treats sattributes differently, making several of those difficult to implement, but some of the other tricks might be an idea for CWB 4, the most easy to implement is probably “substring" - where match.substr(pos,0,1) will take the first letter of match.pos, which allows you to group results by their main POS tag (in a position-based tagset) without having an explicit pattribute for it. 

As it says in the description, it is in no way meant as a replacement for CQP, but provides a way to use CQP corpora in a different way, focussing primarily on those things that happen to be needed for the TEITOK framework; and since it was built for TEITOK, which typically is used for small corpora, speed was not a major concern in this implementation.

Any feedback is most welcome.


More information about the CWB mailing list