[CWB] cqp and very large corpora

Nikola Tulechki nikola.tulechki at gmail.com
Fri Nov 23 17:44:45 CET 2012


Hi
Thanks for the reply, Paul.

I was asking mostly out of interest, as my task was to introduces NLP
students to the tool, so I do not know the type of queries that they will
be running, but I guess that they will not be optimised for performance
(students...)

I didn't think about cold vs saved queries.
I'll test the difference and mention it on the next exercise.

Thanks
NT


On Wed, Nov 21, 2012 at 10:27 AM, Paul Meurer <paul.meurer at uni.no> wrote:

> Hi Nikola,
>
> I understand.
> I just wanted to know if there were any available solutions..
>
> As for better hardware, I am currently running cqp on pretty high-end
> hardware (xeon, 12gb ram) and, exept investing in SSDs in order to speed it
> up some more, there is not much room for improvement…
>
>
> I am running the German DEWAC corpus (1.8G tokens, 1.6G words) on a
> different system (Korpuskel) with an architecture similar to CWB. My
> experience is that a fast disk is what really matters. (Of course, lots of
> RAM and a fast CPU are also an advantage. I wouldn't call 12GB RAM and no
> SSD/RAID exactly high end ;-) We have a RAID60 system with 22 disks, and
> that is even much faster than SSDs. The result is that cold queries (those
> that have to fetch a lot of index pages into RAM) have acceptable speed.
> When you rerun the query, it is the CPU speed that matters alone (give you
> have enough RAM), and for most queries that is orders of magnitude faster
> than the cold query. The same should be true for CWB.
>
> Multithreading could help for some queries (typically those of the
> scanning type, such as searching for two adjacent equal words) if you had
> divided your corpus into parts (or had copies of it) that were located on
> independent disks, such that the threads wouldn't have to compete for disk
> access. (That's my guess, I haven't tested it (yet).)
>
> But still, I am aware that querying 1.5G words in 3-4 minutes is allready
> pretty cool and I thank you for making this the tool
>
>
> Query response times should depend a lot on the type of your query. What
> are you typically querying?
>
> Best wishes,
> Paul
>
>
> regards
> NT
>
>
>
> On Sun, Nov 11, 2012 at 10:16 PM, Hardie, Andrew <a.hardie at lancaster.ac.uk
> > wrote:
>
>>  Better hardware?****
>>
>> ** **
>>
>> I know this sounds glib, but re-engineering CWB to make it multithreaded
>> or to use ancillary database indexes would be a huge undertaking. Throwing
>> better hardware at the problem will almost certainly cost you less than the
>> programmer time to rewrite large chunks of CWB from the ground up.****
>>
>> ** **
>>
>> best****
>>
>> ** **
>>
>> Andrew.****
>>
>> ** **
>>
>> *From:* cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it]
>> *On Behalf Of *Nikola Tulechki
>> *Sent:* 11 November 2012 07:58
>> *To:* Open source development of the Corpus WorkBench
>> *Subject:* [CWB] cqp and very large corpora****
>>
>> ** **
>>
>> Hello****
>>
>> ** **
>>
>> I am using cqp with the *WAC corpora (1.5G words) and, while not
>> prohibiting, response times are still in the minutes range. ****
>>
>> Are there any ways to further speed-up the tool?****
>>
>> Multithreading? Indexes stored in RAM, in DB? ****
>>
>> ** **
>>
>> Thanks****
>>
>> NT****
>>
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it
>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>>
>>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>
>
>
>
> --
> Paul Meurer
> Uni Computing
> Høyteknologisenteret
> Thormøhlensgate 55
> N-5008 Bergen
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20121123/1bb6eecc/attachment.html>


More information about the CWB mailing list