[CWB] BIGINT

Stefan Evert stefanML at collocations.de
Mon Jan 15 10:17:14 CET 2018


> On 15 Jan 2018, at 09:02, Hardie, Andrew <a.hardie at lancaster.ac.uk> wrote:
> 
> This is one of the changes that the new Ziggurat design makes. We can't make such a change in 3.4/3.5 because it breaks compatibility with existing indexed corpora. But, since you're rolling your own, there's no reason for you not to change the file format, unless you need those corpora to work with mainline CWB as well.

To expand on this a little, here are some further technical reasons why we decided not to implement such a 64-bit version of CWB (even if we were willing to break backward compatibility – CWB 3.4 still works perfectly fine with corpora that were indexed back in the last millennium!).

 – Most users would have 32-bit indexed corpora and 64-bit indexed corpora side-by-side, so they'd have to take great care to keep them separate.  There are no version numbers in CWB index files, so apps wouldn't have any way of telling whether they're reading a 32-bit or 64-bit indexed corpus. In the best case, CQP would just crash; in the worst case, it would produce nonsensical results.

 – In addition to changing the file format, we would also have to work through the entire CWB source code and replace each int variable that holds a corpus position by int64_t; you might not want to do the same for variables that hold lexicon IDs.  We would end up rewriting substantial parts of CWB anyway, so a complete reimplementation seems to make more sense.

 – Andrew and I can't afford to store a 10-billion-word corpus in uncompressed form (close to a terabyte if you go all 64-bit), so the compressed file format would also have to be redefined and the compression/decompression code rewritten.  The current implementation makes the implicit assumption that corpus size < 2^31 tokens and will crash otherwise.

Given the amount of work that would need to be done, it made more sense to go for Ziggurat as a long-term solution.

Best,
Stefan 


More information about the CWB mailing list