[CWB] Impossible query

Stefan Evert stefanML at collocations.de
Sat Feb 27 09:43:08 CET 2016


Just a small addition to Andrew's answer:

> Here comes my problem: (a) what can I do if I want the random sample to contain a maximum number of 4 occurrences for each lemma?

(i) Then you're dealing with some form of stratified sampling, which is far too complex to be implemented in CQP.  You can make your external scripts easier if you sort the concordance by the lemma in question, so you can sample your 4 items for each lemma from a contiguous chunk of lines.

(ii) Are you really sure you want to do that?  What you get isn't a random sample in any sense that would allow you to draw statistical inferences.

> (b) what if I want the random sample to contain a maximum of 4 occurrences of any translation in column 3?

That's even harder because CWB treats the entire sets as single strings, so you can't even count the individual items (one of the many things we're planning to improve in CWB 4).

BTW, the Perl/CWB API has support functions for splitting sets, so it might be easiest to do sth like this with a Perl script.

Best,
Stefan.



More information about the CWB mailing list