[CWB] BIGINT

Hardie, Andrew a.hardie at lancaster.ac.uk
Mon Jan 15 09:02:47 CET 2018


This is one of the changes that the new Ziggurat design makes. We can't make such a change in 3.4/3.5 because it breaks compatibility with existing indexed corpora. But, since you're rolling your own, there's no reason for you not to change the file format, unless you need those corpora to work with mainline CWB as well.

best

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Maarten Janssen
Sent: 15 January 2018 01:15
To: cwb at sslmit.unibo.it
Subject: [CWB] BIGINT

After writing a custom version of CQP (which I will happily share once I get the kinks out - it is only a partial implementation of the query language, but it implements some small things I needed, such as sorting on sattributes, persistent naming of tokens, ie. target:[] becomes a synonym for @[], piped conversion from positions to strings, mapping from corpus positions to corpus positions of the head, mi scores, and XML output) and hence looking much further under the hood of CQP, it strongly looks like the only reason there is a 2G word limit is that is uses INT in its files, which could be raised 2^63 tokens (which should be sufficient for the foreseeable future) by using BIGINT instead; and at least in my code (where that particular bit was just copied from CQP, so I assume in CWB as well), the only thing that would involve is change all occurrences of htonl/ntohl to htonll/ntohll (with the corresponding function, since it is unfortunately not standard), and all occurren
 ces to fread/fwrite(&i, 4, 1, stream) to fread/fwrite(&i, 8, 1, stream) - and most of those are centralised, so there are relatively few occurrences of those in the code; Would that not be something worth changing, or am I missing something? 

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://liste.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list