[CWB] Maximum corpus size exceeded
Hardie, Andrew
a.hardie at lancaster.ac.uk
Thu Mar 30 11:11:26 CEST 2017
And our Ziggurat project is designed to address – among other things - precisely this limitation.
Read all about it: http://cwb.sourceforge.net/cwb4.php
best
Andrew.
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Vladimír Benko
Sent: 30 March 2017 09:49
To: ssadowsky at gmail.com
Cc: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Maximum corpus size exceeded
Dear Scott,
Yes, this is a documented limitation of the CWB software. One of the options for larger corpora is a system called NoSketch Engine, which is an open-source subset of the commercial Sketch Engine. The largest corpus we have in our installation of NoSkE is the Russian 13.7 billion Araneum Russicum Maximum. You may want to try how the system feels here:
http://unesco.uniba.sk/guest/index.html
The software itself can be downloaded here:
https://nlp.fi.muni.cz/trac/noske
Best,
Vlado B, 10:45
Hi all,
I just got this warning for the first time:
WARNING: Maximal corpus size has been exceeded.
Input truncated to the first 2147483647 tokens (file /home/homebox/Corpora/source-files//input.vrt, line #3161375683).
Warning: missing </s> tag inserted at end of input.
Is there any way around this, by chance? That's 2^31, just a bit shy of 32 bits, but I'm on a 64 bit system with ext4 filesystems, so I assume the issue is CQB related.
Cheers!
Scott
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it<mailto:CWB at sslmit.unibo.it>
http://liste.sslmit.unibo.it/mailman/listinfo/cwb
--
Vladimír Benko
Université Comenius de Bratislava
Chaire UNESCO de communication
plurilingue et multiculturelle
Šafárikovo námestie 6, SK-81499 Bratislava
http://unesco.uniba.sk/guest/
https://www.facebook.com/araneawebcorpora/
https://vk.com/araneawebcorpora
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20170330/91f33541/attachment-0001.html>
More information about the CWB
mailing list