[CWB] Too many tokens in an attribute?

Simone Ueberwasser simone at ueberwasser.net
Sat Apr 29 08:49:55 CEST 2017


Hi every body
I try to create a CQP corpus with XML attributes from 618 vrt-files. They look as follows:

<text encrypted_msg="0" contains_fra="true" content_msg="1838" user_msg="1873" no_consent_msg="0" consent_speakers="2" lang_100_and_more="fra" speakers="2" empty_msg="0" media_msg="35" system_msg="0" total_msg="1873">

<msg msg_id="162577">
token1 pos1
token2 pos2
</msg>
</text>


An example with two messages is available here: www.ueberwasser.eu/chat105_original.vrt

In this example, the first message contains 3 tokens. If I run the following query, the message is found:
<msg>[]*</msg>:: match.msg_msg_id = “162577"

The second message is 95 tokens long. The same query shows no results:
<msg>[]*</msg>:: match.msg_msg_id = "162578"

If I remove any 5 tokens from this message, the query is fine for this message, too. Is this a normal behaviour? Is there a limit to the number of tokens within an attribute? I could not find any information in the documentation.

Many thanks for any help
Simone 


*********************************
Setup: 
Xubuntu 16.04
cwb-3.0.0-linux-x86_64
CWB Perl-CWB-3.0
CWB-CL Perl-CWB-CL-3.0
CWB-Web Perl-CWB-Web-3.0
CWB-CQI Perl-CWB-CQI-3.0

But I had the same problem with CWB 3.0 and Perl scripts 2.2 on a Mac

I create the corpus with:
sudo -H cwb-encode -c utf8 -x -s -B -d PathToData -f /pathtofile.vrt -R PathToRegisty -P pos -S text:0+contains_fra+no_consent_msg+content_msg+empty_msg+total_msg+speakers+media_msg+system_msg+user_msg+encrypted_msg+consent_speakers+lang_100_and_more+demographics+lang_less_than_100+contains_gsw+contains_eng+contains_spa+contains_deu+contains_ita+contains_sla+contains_roh -S msg:0+msg_id

===========================================================
www.ueberwasser.eu
===========================================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20170429/35629c9d/attachment-0001.html>


More information about the CWB mailing list