[CWB] Appending text to an existing corpus

Stefan Evert stefanML at collocations.de
Thu Nov 8 09:49:29 CET 2012


> I have a pretty simple question: is there any way to append text to an existing corpus?

I'm afraid not.  It was a deliberate design choice -- one supposes -- to make CWB corpora entirely static, so it is not possible to add documents or correct errors in an encoded corpus without re-indexing.  This keeps the index file formats and entire implementation relatively simple and straightforward, which is important for a project carried by relatively little manpower.

If you have your own Web-based front-end to CQP, you might be able to add some code that automatically runs queries across multiple corpora and merges the results (I believe Serge Sharoff used to have such functionality in his Web interface).  You'd then have the main part of your corpus data in one big CWB corpus that is re-encoded only every few months, plus a small CWB corpus collecting new pages.   The latter one would still have to be re-encoded whenever you want to add new data, but this shouldn't take much time.  I know it's rather inconvenient, and will be much less efficient than doing things directly in CQP on a single corpus ...

> Decoding the entire corpus, adding the new data to the generated file and re-encoding the new file is an option, but the server we're running on isn't exactly fast. Any way to save a few CPU cycles and directly insert the new data into the existing corpus? Perhaps there's some functionality to combine two corpora into one?

In theory it would be possible to write a program that merges two CWB-indexed corpora (provided they have exactly the same annotation) more efficiently than encoding from the original source files.  I'm not sure how much of a speed-up this would give, and none of us seems to have enough spare time at the moment to tackle that project.

Best,
Stefan


More information about the CWB mailing list