[CWB] Maximum corpus size exceeded

Scott Sadowsky ssadowsky at gmail.com
Thu Mar 30 11:17:59 CEST 2017


Thanks for the info, Andrew. I just needed to make sure I wasn't doing
something wrong on my end. Can't wait for v4, by the way!

And thanks for the tip, Vladimir. NoSkE certainly looks nice, but I'm
pretty attached to CWB :-)

Cheers,
Scott

On Thu, Mar 30, 2017 at 6:11 AM, Hardie, Andrew <a.hardie at lancaster.ac.uk>
wrote:

> And our Ziggurat project is designed to address – among other things -
> precisely this limitation.
>
>
>
> *Read all about it: http://cwb.sourceforge.net/cwb4.php
> <http://cwb.sourceforge.net/cwb4.php> *
>
>
>
> best
>
>
>
> Andrew.
>
>
>
> *From:* cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] *On
> Behalf Of *Vladimír Benko
> *Sent:* 30 March 2017 09:49
> *To:* ssadowsky at gmail.com
> *Cc:* Open source development of the Corpus WorkBench
> *Subject:* Re: [CWB] Maximum corpus size exceeded
>
>
>
> Dear Scott,
>
> Yes, this is a documented limitation of the CWB software.  One of the
> options for larger corpora is a system called NoSketch Engine, which is an
> open-source subset of the commercial Sketch Engine.  The largest corpus we
> have in our installation of NoSkE is the Russian 13.7 billion Araneum
> Russicum Maximum.  You may want to try how the system feels here:
>
> http://unesco.uniba.sk/guest/index.html
>
> The software itself can be downloaded here:
>
> https://nlp.fi.muni.cz/trac/noske
>
> Best,
>
> Vlado B, 10:45
>
> Hi all,
>
>
>
> I just got this warning for the first time:
>
>
>
> WARNING: Maximal corpus size has been exceeded.
>
>          Input truncated to the first 2147483647 tokens (file
> /home/homebox/Corpora/source-files//input.vrt, line #3161375683).
>
> Warning: missing </s> tag inserted at end of input.
>
>
>
> Is there any way around this, by chance? That's 2^31, just a bit shy of 32
> bits, but I'm on a 64 bit system with ext4 filesystems, so I assume the
> issue is CQB related.
>
>
>
> Cheers!
>
> Scott
>
>
>
>
> _______________________________________________
>
> CWB mailing list
>
> CWB at sslmit.unibo.it
>
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>
>
>
>
>
> --
> Vladimír Benko
>
> Université Comenius de Bratislava
> Chaire UNESCO de communication
> plurilingue et multiculturelle
>
> Šafárikovo námestie 6, SK-81499 Bratislava
>
> http://unesco.uniba.sk/guest/
> https://www.facebook.com/araneawebcorpora/
> https://vk.com/araneawebcorpora
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>
>


-- 
Dr. Scott Sadowsky
Profesor Asistente de Lingüística
Pontificia Universidad Católica de Chile

ssadowsky gmail com
scsadowsky uc cl
http://sadowsky.cl/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20170330/d6ecbc84/attachment.html>


More information about the CWB mailing list