[CWB] [ cwb-Bugs-1489514 ] WebCqp::Query fails on long sentences

SourceForge.net noreply at sourceforge.net
Sun Feb 5 20:16:30 CET 2012


Bugs item #1489514, was opened at 2006-05-16 05:35
Message generated for change (Comment added) made by andrewhardie
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=722303&aid=1489514&group_id=131809

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Engine
Group: TODO-3.5
>Status: Closed
>Resolution: Fixed
Priority: 3
Private: No
Submitted By: Lars Nygaard (larsnyg)
Assigned to: Stefan Evert (schtepf)
Summary: WebCqp::Query fails on long sentences

Initial Comment:
The combination of long sentences and many positional
attributes seems to cause WebCqp::Query to fail: the
process hangs at 99 % cpu usage, but nothing happens.

In my particular case, it was 16 attributes (a detailed
morphological and syntactic analysis of Norwegian) and
some sentences of more than a 100 words. If necessary,
I can provide some exact numbers here.

With 15 attributes, the query works, but I suspect
there will be problems with queries returning even
longer sentences (and there are quite a few, since the
corpus conains literary text, and some authors produce
sentences of many hundreds of words).


----------------------------------------------------------------------

>Comment By: Andrew Hardie (andrewhardie)
Date: 2012-02-05 11:16

Message:
I'm closing this bug because it is an effect of the CQP buffer overflow
which has been fixed in 3.4.3 (in the sense it will no longer crash, though
the desired data will not be retrieved).

Underlying issues with how CQP output works remain, of course....

----------------------------------------------------------------------

Comment By: Stefan Evert (schtepf)
Date: 2006-08-30 05:37

Message:
Logged In: YES 
user_id=545257

CQP crashes only because you try to print too many
attributes at the same time through WebCqp::Query.  If you
use CQP.pm and "dump" the query matches, you can then access
all necessary attributes directly through the CL.pm module,
without problems, or use the "repeated cat" solution.

Multiple "cat"s shouldn't be much slower on the CQP side,
but Perl may become substantially slower because of the
overhead (depends on how you use that information in Perl).
 In general, direct access with CL.pm is the fastest solution.

I will add a bug report for the general kwic output problem.
 Should such known bugs also be documented on the CWB Web
site, in the official manuals, or somewhere else?

----------------------------------------------------------------------

Comment By: Lars Nygaard (larsnyg)
Date: 2006-08-30 04:54

Message:
Logged In: YES 
user_id=1035773

If I understand correctly, the proporsed workaround
(provided I need sentence context, and since direct access
through CL does not help since it's CQP itself that crashes)
is using multiple "cat"s. I guess this will come at a severe
speed penalty, and hence will not be appropriate in my case.

I ended up using another workaround: packed attribute
tables. Since some morphological categories are only used
for certain parts-of-speech, I introduced attributes like
"case_tense" and "person_mood". I can then create queries
like "case_tense='past'" or "case_tense='nominative'". This
reduced the number of attributes sufficiently for my corpus,
though I also had to delete a couple of sentences (one text
in the corpus was actually 1000 words long).

Anyway, if it can't be fixed, I think this bug should be
documented. It sure caused me a lot of pain to figure out
what was wrong.

cheers,
lars


----------------------------------------------------------------------

Comment By: Stefan Evert (schtepf)
Date: 2006-08-30 03:57

Message:
Logged In: YES 
user_id=545257

CQP crashes when formatted kwic output lines get too long,
which is probably what happens here: WebCqp::Query then
hangs trying to read CQP output because it doesn't realize
that the CQP backend has died (can you check whether the cqp
process is really gone and Perl takes all the CPU load?). 
This is a very fundamental problem of the kwic output and
will need a major rewrite of CQP to be resolved.

For the time being, there are better ways of getting hold of
attribute values in a CGI script; e.g. use "tabulate" if
sentence context isn't mandatory, multiple "cat"s with one
attribute each, or direct access through the CL module. 
WebCqp::Query isn't reliable enough for this kind of
heavy-duty processing, and was never meant to be (a rewrite
of the Perl module could improve the situation but would
probably also lead to considerable speed losses).

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=722303&aid=1489514&group_id=131809


More information about the CWB mailing list