[CWB] Make 'cut' treat ranges like 'cat'?

Stefan Evert stefanML at collocations.de
Sun Nov 12 16:02:23 CET 2017


> > cut Last 5 15
> CQP Error:
>         Invalid range end 15 for cut operator (only 10 matches).

Oh, I wasn't aware that this variant of the cut command exists at all.  Where did you find it?

> Would it be possible to make 'cut' treat ranges the same was as 'cat', i.e. not raising errors but instead keeping the matches it can given the range?

Makes sense to me. The error is intended to catch mistakes made by the user (or client application), but the behaviour of "cut" is so involved that most actual errors won't be caught anyway.

Before we make any changes, we need to figure out how "cut" should behave, though.  At the moment, there are three different forms of "cut" (in addition to the "cut" appended directly to a query):

a) cut Res <N>; 

Reduces Res to the first <N> matches, except when it doesn't:
 - If Res has fewer than <N> hits, an error is thrown (as in the case Martin complained about).
 - If <N> = 0, Res remains unchanged (i.e. it is cut to _all_ hits).
 - If <N> = 0 and Res is empty, it remains unchanged but there is a warning.

b) cut Res <A> <B>;

Reduces Res to hits <A> through <B> (the usual 0-based indices), except:
 - If <B> is larger than the last valid index, an error is thrown ("invalid range end").
 - If <A> is larger than the last valid index (or in general larger than <B>), Res becomes empty (with a warning)

c) cut Res <A> <B>;

where <A> or <B> is a negative number.  The negative number refers to an index from the end of the query result, i.e. -1 to the last hit, -2 to the second but last, etc.  Negative and positive offsets can be mixed, which gives us special cases such as:
 - cut Res 0 -1; leaves Res unchanged (and corresponds to cut Res 0; above)
 - If <A> is negative but its magnitude exceeds the query size, an error is thrown ("invalid range start").
 - If <B> is negative but its magnitude exceeds the query size, Res becomes empty (with a warning).

As you can see, it's a horrible mess.


My suggestion would be to

1) disallow negative values for <A> and <B> as indices from the end – or does anybody actually use them?

2) clamp the specified range to the query size, possibly issuing a warning if start or end are out of range


Comments?

Best,
Stefan


More information about the CWB mailing list