[CWB] [ cwb-Bugs-1549254 ] CQP crashes on long kwic output lines

SourceForge.net noreply at sourceforge.net
Thu Nov 3 14:42:39 CET 2011


Bugs item #1549254, was opened at 2006-08-30 14:51
Message generated for change (Comment added) made by schtepf
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=722303&aid=1549254&group_id=131809

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: CQP interface
Group: TODO-3.5
>Status: Closed
>Resolution: Fixed
Priority: 7
Private: No
Submitted By: Stefan Evert (schtepf)
Assigned to: Stefan Evert (schtepf)
Summary: CQP crashes on long kwic output lines

Initial Comment:
When kwic output lines (generated by the "cat" command
in CQP) get too long, CQP will crash suddenly by
segmentation fault.  This happens typically when (i)
many positional and/or structural attributes with long
values are printed, (ii) context is set to sentence and
the corpus contains very long sentences (often due to
errors in the markup), or (iii) context is set to large
text regions such as paragraphs or entire documents (or
matches are expanded to such regions).  

The reason for the crash is a simple buffer overflow,
since the kwic formatting routines (in
<cqp/concordance.c>) use a fixed buffer for compiling
the output lines.  The size of this buffer is hardcoded
in <cqp/concordance.c> (MAXKWICLINELEN constant) and is
currently set to 32768 characters.

----------------------------------------------------------------------

>Comment By: Stefan Evert (schtepf)
Date: 2011-11-03 14:42

Message:
Fixed in trunk/ and branches/3.0/ with SVN revision 278.

Buffer overflow is now correctly detected and the kwic line is truncated
as necessary.  This means that long lines will silently be cut off and may
show some weird effects around the edges.  But definitely an improvement
over the segfault, which causes the CWB::CQP interface to hang.

The default size of the internal buffer has been increased to 65535
characters, which seems to be enough for BNCweb to display all sentences in
the BNC correctly.


----------------------------------------------------------------------

Comment By: Stefan Evert (schtepf)
Date: 2006-08-30 14:57

Message:
Logged In: YES 
user_id=545257

Actually, the formatting code already checks for buffer
overflow, simply cutting off the output after
MAXKWICELINELEN bytes.  I believe that it just forgets to
terminate the truncated string with a NUL character so that
the C standard library crashes when it tries to print the
string.  Needs some more thorough investigation, though.

While it would be relatively easy to patch up the problem
for now (make sure that output string is always
NUL-terminated, increase buffer size to handle all commonly
encountered situations), a fundamental redesign of the kwic
formatting code is direly needed and I would prefer to keep
this bug on hold till then.



----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=722303&aid=1549254&group_id=131809


More information about the CWB mailing list