[CWB] input buffer overflow in CWB

Stefan Evert schtepf at gmail.com
Fri Jul 7 08:29:23 CEST 2017


> is there a way to avoid the following error:
> 
>> input buffer overflow, can't enlarge buffer because scanner uses REJECT

Only if you reimplement the flex / bison parser without this restriction.  Or if we changed the grammar not to use a REJECT clause. :-)


Background info and a question:

REJECT is used only to ensure full backward compatibility if users disable CQP macros. Since the macro feature has been in use for more than 10 years – apparently without any problems – perhaps we could think about making it non-optional.  Has anybody ever had a need to disable macro expansion?  Did you even know that macros can be turned off?


On the other hand, this should only happen if your query is longer than 16 KB, and neither CQP (which uses fixed-size buffers for regular expressions and many other things) nor the backend regexp library will be happy with those.   I'm not sure about PCRE's precise limits (man pcrelimits says the "compiled pattern" cannot be larger than approx. 64K data units, whose size depends on whether you have an 8bit, 16bit or 32bit version of PCRE on your machine), but POSIX regexp functions often put a limit at around 4 KB and silently discarded everything after this point – which would simply lead to incomplete and wrong results in your case.


> when using long lemma lists as cumulative search strings in CWB, say
> 
> [lemma="aa|bb|cc|dd|…"]

If you're searching for a particular list of fixed strings (as in the example above) you should put them in a 1-word-per-line file and load them into a wordlist variable:

	define $words < "my_words.txt";
	[lemma = $words];

In this way, you also won't have to worry about metacharacters in the string.

This doesn't work for case-insensitive searches and wildcard patterns (e.g. "positiv.*|good.*|fantastic.*|…"), though.  You can compile a wordlist into a regexp disjunction (metacharacters won't be escaped in this case) to get around limitations of the parser, but you'll still have to deal with the internal size limits of CQP and PCRE:

	[lemma = RE($words) %c];

Best,
Stefan



More information about the CWB mailing list