[CWB] Basic sort on structural attribute

Stefan Evert stefanML at collocations.de
Fri Oct 20 19:40:01 CEST 2017


> (BTW, one obvious hack to solve this problem would be to encode the same corpus several times, ordering the input data in different ways. Maybe that IS the most efficient solution in terms of time). 

It would usually be more efficient to encode the relevant information as p-attributes, which simply have the same value for all tokens within the same s-attribute region.  In earlier versions, this was required for frequency counts with "group", too (and recommended in the CQP tutorial).

However, you can specify only a single sort key with "sort", which will probably not be enough for most use cases.

> I understand. About the longer explanation, I hesitate to make you spend the time laying this to me - maybe there is some document out there? Surely this is not the first time this has cropped up. 

I don't recall anybody asking for this option.  I suppose most people are more interested in frequency counts of metadata and would then take subsets of the query result for each metadata category.

A request we did have is to support multiple sort keys, but that would require substantial refactoring.  Something we might address in CWB 4.

> Otherwise - yes, I would love to hear that! 

I'll give an example with the DICKENS demo corpus:

	DICKENS;
	set PrintStructures "novel_title, chapter_num";
	C = "coffee";
	tabulate C match, matchend, match novel_title, match chapter_num > '| sort -t " " -k 3,3 -k 4,4n -k 1n | cut -f 1-2 > /tmp/sorted.txt';
	undump CS < "/tmp/sorted.txt";
	cat CS;

Note that -t " " must have a literal TAB character between the double quotes, since sort doesn't understand \t and CQP doesn't interpolate it in this case either.

(Things are less hacky if you run a separate shell script for the sorting procedure, of course.)

Best,
Stefan


More information about the CWB mailing list