[CWB] CQP bug report?

Stefan Evert stefan.evert at uos.de
Thu Feb 26 17:51:08 CET 2009


Dear all,

just to let you know what happened: Eros and I were able to solve the  
problem off-list (thanks for the detailed bug report and excellent  
support!).

It turned out that my initial guess was correct. The CWB uses fixed- 
size internal buffers in some places, including string normalisation  
for %c and %d flags. This imposes a hard limit on the length of  
annotation strings (both in positional and structural attributes),  
which is currently 4095 characters (MAX_LINE_LENGTH - 1). Eros'  
version of ITWAC contains some tokens that are about 7000 characters  
long (apparently something that Italians say when they go for their  
annual medical checkup ;-), causing the observed memory corruption.

I have added some safety checks to cwb-encode in the development  
version of the CWB now (which should have been there in the first  
place, of course).  If an input file contains oversized strings, cwb- 
encode will print a warning and truncate the string to the first 4094  
characters + '$' indicating the truncation.

You can always download the very latest code from the SourceForge SVN,  
following the instructions at cwb.sf.net, but an official release of  
the code is "imminent"(1).

Best to everyone,
Stefan


(1) = as it has been for the last 5 years or so ... ;-}



More information about the CWB mailing list