[CWB] encoding the entire UKWAC

Stefan Evert stefanML at collocations.de
Sun Mar 13 11:31:09 CET 2011


> I wanted to encode and index the entire UKWAC corpus but cwb-makeall fails to index files because the memory problem arises (with the 'out of memory' message popping out). Does anyone have any suggestions how to tackle this problem?

The full ukWaC is too big to be indexed as a single corpus in the CWB, because it exceeds the maximal size of 2^31 = 2.1 billion tokens (this may change for later, "cleaned" versions of the corpus).  You have to omit the last one or two segments (i.e. separate files in the distribution), or split it into two CWB corpora of about 1.2 billion tokens each.

In any case, you will only be able to index a corpus of more than 500 million tokens if you have a 64-bit version of the CWB (and, of course, a 64-bit CPU and operating system to run it on).  Appropriate binary distributions are available for Linux (cwb-3.0.0-linux-x86_64.tar.gz) and Mac OS X (cwb-3.0.0-osx-10.5-universal.tar.gz, if you run it on a recent computer with Core2 CPU or newer).

> I made a subcorpus using the function subqueries but I don't manage to search concordance lines. For instance, I made a subcorpus A = 'reason' 'to' and when I search for 'believe' (which is in the corpus) it returns 0 matches. I guess I'm doing here something wrong.  

No surprise at all, if you see it from CQP's perspective: Your subcorpus A consists of all occurrences of the sequence "reason to". If you activate the subcorpus A, CQP restricts the following queries to precisely these sequences -- and you won't find the word "believe" in this subcorpus, of course!

What you probably meant to do is this:

	A = "reason" "to" expand to s;
	A;
	"believe";

WARNING to everyone: traditionally, this kind of subquery was often done by implicit expansion

	A = "reason" "to";
	A^s;
	"believe";

but I just noticed that this doesn't work properly (at least in the current release of CQP).  A bug report has already been filed.

Best,
Stefan





More information about the CWB mailing list