[CWB] BIGINT

Maarten Janssen maartenpt at gmail.com
Mon Jan 15 02:15:24 CET 2018


After writing a custom version of CQP (which I will happily share once I get the kinks out - it is only a partial implementation of the query language, but it implements some small things I needed, such as sorting on sattributes, persistent naming of tokens, ie. target:[] becomes a synonym for @[], piped conversion from positions to strings, mapping from corpus positions to corpus positions of the head, mi scores, and XML output) and hence looking much further under the hood of CQP, it strongly looks like the only reason there is a 2G word limit is that is uses INT in its files, which could be raised 2^63 tokens (which should be sufficient for the foreseeable future) by using BIGINT instead; and at least in my code (where that particular bit was just copied from CQP, so I assume in CWB as well), the only thing that would involve is change all occurrences of htonl/ntohl to htonll/ntohll (with the corresponding function, since it is unfortunately not standard), and all occurrences to fread/fwrite(&i, 4, 1, stream) to fread/fwrite(&i, 8, 1, stream) - and most of those are centralised, so there are relatively few occurrences of those in the code; Would that not be something worth changing, or am I missing something? 



More information about the CWB mailing list