[CWB] Too many tokens in an attribute?

Hardie, Andrew a.hardie at lancaster.ac.uk
Sun Apr 30 02:25:34 CEST 2017


This is … interesting behaviour.

I see you’re using 3.0.0 but I’ve reproduced it with 3.4.10 and a 1.2 billion token corpus:

EEBOV3> L = <text>[]*</text>
EEBOV3> size L;
761
EEBOV3> LL = <text>[]
EEBOV3> size LL;
44421
EEBOV3> LLL = <text>[] expand to text;
EEBOV3> size LLL;
44421

I am not really sure what’s going on here. It probably involves the parts of the CQP code that I’m a bit scared of…

Andrew.

From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Simone Ueberwasser
Sent: 29 April 2017 07:50
To: cwb at sslmit.unibo.it
Subject: [CWB] Too many tokens in an attribute?

Hi every body
I try to create a CQP corpus with XML attributes from 618 vrt-files. They look as follows:

<text encrypted_msg="0" contains_fra="true" content_msg="1838" user_msg="1873" no_consent_msg="0" consent_speakers="2" lang_100_and_more="fra" speakers="2" empty_msg="0" media_msg="35" system_msg="0" total_msg="1873">

<msg msg_id="162577">
token1 pos1
token2 pos2
</msg>
</text>


An example with two messages is available here: www.ueberwasser.eu/chat105_original.vrt<http://www.ueberwasser.eu/chat105_original.vrt>

In this example, the first message contains 3 tokens. If I run the following query, the message is found:
<msg>[]*</msg>:: match.msg_msg_id = “162577"

The second message is 95 tokens long. The same query shows no results:
<msg>[]*</msg>:: match.msg_msg_id = "162578"

If I remove any 5 tokens from this message, the query is fine for this message, too. Is this a normal behaviour? Is there a limit to the number of tokens within an attribute? I could not find any information in the documentation.

Many thanks for any help
Simone


*********************************
Setup:
Xubuntu 16.04
cwb-3.0.0-linux-x86_64
CWB Perl-CWB-3.0
CWB-CL Perl-CWB-CL-3.0
CWB-Web Perl-CWB-Web-3.0
CWB-CQI Perl-CWB-CQI-3.0

But I had the same problem with CWB 3.0 and Perl scripts 2.2 on a Mac

I create the corpus with:
sudo -H cwb-encode -c utf8 -x -s -B -d PathToData -f /pathtofile.vrt -R PathToRegisty -P pos -S text:0+contains_fra+no_consent_msg+content_msg+empty_msg+total_msg+speakers+media_msg+system_msg+user_msg+encrypted_msg+consent_speakers+lang_100_and_more+demographics+lang_less_than_100+contains_gsw+contains_eng+contains_spa+contains_deu+contains_ita+contains_sla+contains_roh -S msg:0+msg_id

===========================================================
www.ueberwasser.eu<http://www.ueberwasser.eu/>
===========================================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20170430/b083b24e/attachment.html>


More information about the CWB mailing list