[CWB] Maximum corpus size exceeded

Hardie, Andrew a.hardie at lancaster.ac.uk
Thu Mar 30 11:22:29 CEST 2017


Stefan and I are anxious to get underway with v4 too. There always seems to be just one more thing to fix with the current version before we can move on, though…

Andrew.

From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Scott Sadowsky
Sent: 30 March 2017 10:18
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Maximum corpus size exceeded

Thanks for the info, Andrew. I just needed to make sure I wasn't doing something wrong on my end. Can't wait for v4, by the way!

And thanks for the tip, Vladimir. NoSkE certainly looks nice, but I'm pretty attached to CWB :-)

Cheers,
Scott

On Thu, Mar 30, 2017 at 6:11 AM, Hardie, Andrew <a.hardie at lancaster.ac.uk<mailto:a.hardie at lancaster.ac.uk>> wrote:
And our Ziggurat project is designed to address – among other things - precisely this limitation.

Read all about it: http://cwb.sourceforge.net/cwb4.php

best

Andrew.

From: cwb-bounces at sslmit.unibo.it<mailto:cwb-bounces at sslmit.unibo.it> [mailto:cwb-bounces at sslmit.unibo.it<mailto:cwb-bounces at sslmit.unibo.it>] On Behalf Of Vladimír Benko
Sent: 30 March 2017 09:49
To: ssadowsky at gmail.com<mailto:ssadowsky at gmail.com>
Cc: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Maximum corpus size exceeded

Dear Scott,

Yes, this is a documented limitation of the CWB software.  One of the options for larger corpora is a system called NoSketch Engine, which is an open-source subset of the commercial Sketch Engine.  The largest corpus we have in our installation of NoSkE is the Russian 13.7 billion Araneum Russicum Maximum.  You may want to try how the system feels here:

http://unesco.uniba.sk/guest/index.html

The software itself can be downloaded here:

https://nlp.fi.muni.cz/trac/noske

Best,

Vlado B, 10:45
Hi all,

I just got this warning for the first time:

WARNING: Maximal corpus size has been exceeded.
         Input truncated to the first 2147483647 tokens (file /home/homebox/Corpora/source-files//input.vrt, line #3161375683).
Warning: missing </s> tag inserted at end of input.

Is there any way around this, by chance? That's 2^31, just a bit shy of 32 bits, but I'm on a 64 bit system with ext4 filesystems, so I assume the issue is CQB related.

Cheers!
Scott



_______________________________________________

CWB mailing list

CWB at sslmit.unibo.it<mailto:CWB at sslmit.unibo.it>

http://liste.sslmit.unibo.it/mailman/listinfo/cwb



--
Vladimír Benko

Université Comenius de Bratislava
Chaire UNESCO de communication
plurilingue et multiculturelle

Šafárikovo námestie 6, SK-81499 Bratislava

http://unesco.uniba.sk/guest/
https://www.facebook.com/araneawebcorpora/
https://vk.com/araneawebcorpora

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it<mailto:CWB at sslmit.unibo.it>
http://liste.sslmit.unibo.it/mailman/listinfo/cwb



--
Dr. Scott Sadowsky
Profesor Asistente de Lingüística
Pontificia Universidad Católica de Chile

ssadowsky gmail com
scsadowsky uc cl
http://sadowsky.cl/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20170330/147ad72c/attachment-0001.html>


More information about the CWB mailing list