[CWB] CQP bug report?
Stefan Evert
stefan.evert at uos.de
Thu Feb 26 17:51:08 CET 2009
Dear all,
just to let you know what happened: Eros and I were able to solve the
problem off-list (thanks for the detailed bug report and excellent
support!).
It turned out that my initial guess was correct. The CWB uses fixed-
size internal buffers in some places, including string normalisation
for %c and %d flags. This imposes a hard limit on the length of
annotation strings (both in positional and structural attributes),
which is currently 4095 characters (MAX_LINE_LENGTH - 1). Eros'
version of ITWAC contains some tokens that are about 7000 characters
long (apparently something that Italians say when they go for their
annual medical checkup ;-), causing the observed memory corruption.
I have added some safety checks to cwb-encode in the development
version of the CWB now (which should have been there in the first
place, of course). If an input file contains oversized strings, cwb-
encode will print a warning and truncate the string to the first 4094
characters + '$' indicating the truncation.
You can always download the very latest code from the SourceForge SVN,
following the instructions at cwb.sf.net, but an official release of
the code is "imminent"(1).
Best to everyone,
Stefan
(1) = as it has been for the last 5 years or so ... ;-}
More information about the CWB
mailing list