[CWB] Tiger-CWB

Maarten Janssen maartenpt at gmail.com
Mon Jul 3 00:36:45 CEST 2017


As a follow-up on an earlier message of mine, I decided to just build a small module that allow dependency queries on a CWB corpus - and made it available as a git repository:

https://gitlab.com/maartenes/TIGER-CWB

It is (still) just a quick set-up, it is not terribly efficient and gets slow on large corpora, and it relies on CWB::CL, but it does allow you to quickly get (for example) the most frequent subjects of a verb, as in this example:

a:[lemma="considerar"]; # Search for all occurrences of the lemma considerar
b:[pos="NOUN"]; # Search for nouns in the sentences that had considerar in them
a >nsubj b; # Check that the noun is the nominal subject of the verb
matchlist; # Print out all the matching sentences with the matches for a and b tagged with XML tags
freq(b.lemma); # Print out a frequency breakdown of the lemma of b - ie. print the most frequent subjects of the verb considerar

It has a command-line interface similar to CWB itself, with -c and -r options. 

perl tiger-sh.pl -D MYCORPUS

Any comments/suggestions are most welcome at this point - even the suggestions to just use an existing platform that already does something similar, but I have not seen any similar tools out there, and it does provide a quick way to get more out of a CWB corpus that contains dependency relations. 

Current syntax (there are still some restrictions but most in principle work):

MYCORPUS - choose a corpus
a:[pos=“NOUN"] - add a new token variable (with CQL restrictions)
a > b - define a relation between two variables; options:  head (>), followed by (.), and sibling ($), which can be negated (!.) or starred (.*)
text_year = “1900” - define sattribute restrictions on the sentences 
freq(b.lemma) - show frequency breakdown of a pattribute on on of the token variables

I have a local version of the Universal Dependency Treebank converted to TEITOK - CWB that I will make available after getting rid of some more bugs, which will allow people to check queries of the type listed above.


More information about the CWB mailing list