[CWB] How to know how many texts do not contain a match of a macro

Sun Jun 2 20:15:26 CEST 2013

On 28 May 2013, at 19:11, Jose Manuel Martinez Martinez <jmmtra at gmail.com> wrote:

> So far so good. But in this way I only get the list of texts (and the number of instances) where "man" was found. But I would like to obtain a list including also the texts that didn't contain any instance of "man".
> 
> Something like this:
> 
>  Frankfurt_1789_9            20
>  Danckwerth_1729_22      10
>  Danckwerth_1729_21        7
>  Deckhardt_1611_20          7
>  Frankfurt_1789_10            7
>  Stockholm_1647_9            7
>  Knopf_1800_2                   0
>  Graz_1686_8                     0
> Wecker_1679_8                  0
> Danckwerth_1729_20         0
> 
> Is it possible with CQP?

I'm afraid not, and AFAIK most Web interfaces don't offer such an option either.

You'll need a small post processing script, which has a list of all the texts so it can automatically insert the zero frequency counts. 

If you know that your query never matches at the start of a text (i.e. as the text's first token), you could use the following trick:

	A = /man[];
	B = <text> [];
	C = union A B;
	group C match text_id;

This adds one fake match for every text (from query result B), so you'll get frequency counts that are too large by one, but it should be easy enough to adjust them.  However, matches of A at the start of a text will be collapsed with the corresponding match from B, leading to incorrect frequency counts.

Best,
Stefan