[CWB] Finding paragraphs with a minimum number of hits

Thu May 24 22:04:50 CEST 2018

> in CQP, is there a way to find paragraphs <p> with a minimum number of hits for a certain pattern, say min 3 hits for [lemma="nach“] [pos=ADJA"] [pos="NN"] ?
> 
> I could do
> 
> [lemma="nach"] [pos="ADJA"] [pos="NN"] []* [lemma="nach"] [pos="ADJA"] [pos="NN"] []* [lemma="nach"] [pos="ADJA"] [pos="NN"] within p
> 
> but this seems to be rather ineffective for large corpora.

There's no straightforward solution except for the rather inefficient query you sketched above.  If you want to obtain paragraph counts for query hits, the first thing you need to do is annotate <p> regions with unique IDs (say, in the s-attribute p_id).

Then you can either tabulate matches with paragraph IDs:

	A = [lemma="nach"] [pos="ADJA"] [pos="NN"];
	tabulate A match p_id, match .. matchend lemma > "results.txt";

and use some external script to group by paragraph and sort the groups by number of matches per paragraph, as needed.  You may be able to do this directly with a unix pipe instead of redirecting output to a file.

If the paragraph counts are sufficient, you can obtain them directly:

	group A match p_id;

but there is currently no really easy way to restrict matches to these paragraphs.  To identify paragraphs with at least 3 matches, you have to go through these steps:

	set PrettyPrint off;
	group A match p_id cut 3 > "| cut -f1 > /tmp/para.txt";
	define $para3 < "/tmp/para.txt";
	B = [lemma="nach"] [pos="ADJA"] [pos="NN"] :: match.p_id = RE($para3);

Hope this helps,
Stefan