[CWB] Querying text length

Thu Jun 1 09:56:17 CEST 2017

> On 29 May 2017, at 15:19, Meier-Vieracker, Simon <simon.meier at tu-berlin.de> wrote:
> 
> is there a way to find the shortest and longest text in a CWB-corpus? Something like <text>[]*</text> and then ordering the results by number of tokens?

Not within CQP, but if you're working in a proper Unix-style environment, you can do it from the command line with awk, sort, etc.  For example shows the ten longest texts with

	cwb-s-decode BNC -S text_id | awk -F"\t" '{print $2 - $1 + 1, $3}' | sort -nr | head -10

Where $2 - $1 + 1 (= text end - text start + 1) is the length of a given text in # tokens.  Changing "sort -nr" to "sort -n" will give you the 10 shortest texts instead.

If you don't have text IDs (or titles, or some other unique identifiers), you could simply print the start cpos in the third column and use it to locate the respective text in CQP (though that's rather inefficient).

	cwb-s-decode BNC -S text | awk -F"\t" '{print $2 - $1 + 1, $1}' | sort -n | head -10

Showing that the shortest text in the BNC starts at cpos 70473801.  In CQP you can then do

	BNC;
	<text> [ _ = 70473801 ] expand to text;

to select the entire text.  (You may know that _ is a reference to the current token, which is normally used to access s-attribute values. However, when used in bare form it evaluates to the corpus position as an integer value.)

The same tricks can be achieved within a CQP session using pipes, e.g.

	BNC;
	Texts = <text> [] expand to text;
	tabulate Texts match, matchend, match text_id > "| awk -F'\t' '{print $2 - $1 + 1, $3}' | sort -n | head -10";

Best,
Stefan