[CWB] Too many tokens in an attribute?

Hardie, Andrew a.hardie at lancaster.ac.uk
Sun Apr 30 03:29:15 CEST 2017


I have done a spot of digging in the scary code and found the answer: there is a setting called “HardBoundary” which sets a limit on the length in tokens of matches retrieved. By default, it is set to 100.

You can change it to a different value thus:

set HardBoundary 200;
set HardBoundary 1000000;
(etc.)

I don’t know why HardBoundary is not mentioned in the tutorial, but I think its existence is because otherwise, execution of a query can take an awfully long time: if you use []* over a lot of tokens then each one of those has to be evaluated which involves lots and lots of messing around with memory for tables of corpus positions.

Although “HardBoundary” is not in the tutorial, nor is it listed when you run “set;”, another way of accessing the same setting (at startup) is, however, noted in cqp –h:

    […]
    -b num       set hard boundary for kleene star to <num> tokens
    […]

and in “man cqp” both –b and HardBoundary get a mention:

       -b num
           Sets a hard boundary for the kleene star in token-sequence regular expressions (not string-matching
           regular expressions). When this option is specified, matches will be made across no more than num
           tokens.  This is basically the same as adding a "within num" clause to the CQP query; an explicit
           "within" clause overrides the hard boundary, if specified.

           Equivalent to the interactive command "set HardBoundary num".


The manual probably ought to mention it is set by default to 100; moreover the current wording “When this option is specified” is misleading as it suggests it can be left unspecified (it can’t: it always has to have some value). Also, it applies not only to * but generally for repetition.

The default of 100 seems a bit stingy; I will provisionally change it to 500.

best

Andrew.

From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Hardie, Andrew
Sent: 30 April 2017 01:26
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Too many tokens in an attribute?

This is … interesting behaviour.

I see you’re using 3.0.0 but I’ve reproduced it with 3.4.10 and a 1.2 billion token corpus:

EEBOV3> L = <text>[]*</text>
EEBOV3> size L;
761
EEBOV3> LL = <text>[]
EEBOV3> size LL;
44421
EEBOV3> LLL = <text>[] expand to text;
EEBOV3> size LLL;
44421

I am not really sure what’s going on here. It probably involves the parts of the CQP code that I’m a bit scared of…

Andrew.

From: cwb-bounces at sslmit.unibo.it<mailto:cwb-bounces at sslmit.unibo.it> [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Simone Ueberwasser
Sent: 29 April 2017 07:50
To: cwb at sslmit.unibo.it<mailto:cwb at sslmit.unibo.it>
Subject: [CWB] Too many tokens in an attribute?

Hi every body
I try to create a CQP corpus with XML attributes from 618 vrt-files. They look as follows:

<text encrypted_msg="0" contains_fra="true" content_msg="1838" user_msg="1873" no_consent_msg="0" consent_speakers="2" lang_100_and_more="fra" speakers="2" empty_msg="0" media_msg="35" system_msg="0" total_msg="1873">

<msg msg_id="162577">
token1 pos1
token2 pos2
</msg>
</text>


An example with two messages is available here: www.ueberwasser.eu/chat105_original.vrt<http://www.ueberwasser.eu/chat105_original.vrt>

In this example, the first message contains 3 tokens. If I run the following query, the message is found:
<msg>[]*</msg>:: match.msg_msg_id = “162577"

The second message is 95 tokens long. The same query shows no results:
<msg>[]*</msg>:: match.msg_msg_id = "162578"

If I remove any 5 tokens from this message, the query is fine for this message, too. Is this a normal behaviour? Is there a limit to the number of tokens within an attribute? I could not find any information in the documentation.

Many thanks for any help
Simone


*********************************
Setup:
Xubuntu 16.04
cwb-3.0.0-linux-x86_64
CWB Perl-CWB-3.0
CWB-CL Perl-CWB-CL-3.0
CWB-Web Perl-CWB-Web-3.0
CWB-CQI Perl-CWB-CQI-3.0

But I had the same problem with CWB 3.0 and Perl scripts 2.2 on a Mac

I create the corpus with:
sudo -H cwb-encode -c utf8 -x -s -B -d PathToData -f /pathtofile.vrt -R PathToRegisty -P pos -S text:0+contains_fra+no_consent_msg+content_msg+empty_msg+total_msg+speakers+media_msg+system_msg+user_msg+encrypted_msg+consent_speakers+lang_100_and_more+demographics+lang_less_than_100+contains_gsw+contains_eng+contains_spa+contains_deu+contains_ita+contains_sla+contains_roh -S msg:0+msg_id

===========================================================
www.ueberwasser.eu<http://www.ueberwasser.eu/>
===========================================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20170430/708917b6/attachment-0001.html>


More information about the CWB mailing list