[CWB] cqp and very large corpora

Paul Meurer paul.meurer at uni.no
Wed Nov 21 10:27:56 CET 2012


Hi Nikola,

> I understand. 
> I just wanted to know if there were any available solutions..
> 
> As for better hardware, I am currently running cqp on pretty high-end hardware (xeon, 12gb ram) and, exept investing in SSDs in order to speed it up some more, there is not much room for improvement…

I am running the German DEWAC corpus (1.8G tokens, 1.6G words) on a different system (Korpuskel) with an architecture similar to CWB. My experience is that a fast disk is what really matters. (Of course, lots of RAM and a fast CPU are also an advantage. I wouldn't call 12GB RAM and no SSD/RAID exactly high end ;-) We have a RAID60 system with 22 disks, and that is even much faster than SSDs. The result is that cold queries (those that have to fetch a lot of index pages into RAM) have acceptable speed. When you rerun the query, it is the CPU speed that matters alone (give you have enough RAM), and for most queries that is orders of magnitude faster than the cold query. The same should be true for CWB.

Multithreading could help for some queries (typically those of the scanning type, such as searching for two adjacent equal words) if you had divided your corpus into parts (or had copies of it) that were located on independent disks, such that the threads wouldn't have to compete for disk access. (That's my guess, I haven't tested it (yet).)

> But still, I am aware that querying 1.5G words in 3-4 minutes is allready pretty cool and I thank you for making this the tool

Query response times should depend a lot on the type of your query. What are you typically querying?

Best wishes,
Paul

> 
> regards
> NT   
>   
> 
> 
> On Sun, Nov 11, 2012 at 10:16 PM, Hardie, Andrew <a.hardie at lancaster.ac.uk> wrote:
> Better hardware?
> 
>  
> 
> I know this sounds glib, but re-engineering CWB to make it multithreaded or to use ancillary database indexes would be a huge undertaking. Throwing better hardware at the problem will almost certainly cost you less than the programmer time to rewrite large chunks of CWB from the ground up.
> 
>  
> 
> best
> 
>  
> 
> Andrew.
> 
>  
> 
> From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Nikola Tulechki
> Sent: 11 November 2012 07:58
> To: Open source development of the Corpus WorkBench
> Subject: [CWB] cqp and very large corpora
> 
>  
> 
> Hello
> 
>  
> 
> I am using cqp with the *WAC corpora (1.5G words) and, while not prohibiting, response times are still in the minutes range. 
> 
> Are there any ways to further speed-up the tool?
> 
> Multithreading? Indexes stored in RAM, in DB? 
> 
>  
> 
> Thanks
> 
> NT
> 
> 
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> 
> 
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb



-- 
Paul Meurer
Uni Computing
Høyteknologisenteret
Thormøhlensgate 55
N-5008 Bergen
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20121121/e7e74e58/attachment.html>


More information about the CWB mailing list