[CWB] Problem with cwb-make

Stefan Evert stefanML at collocations.de
Thu May 10 14:15:34 CEST 2012


> Btw, id is specific for each token as the source uses word encoding. So I assume it should stay a p-attribute.

Yes, but don't do that if the corpus is to be of any substantial size.  CWB assumes that all  annotations are linguistic features with a limited number of types (i.e. distinct values).

If you build a 100 M word corpus with unique IDs in a p-attribute, CWB has to store a lexicon of 100 M plain text IDs, Huffmann compression will increase rather than decrease the disk size of the attribute, and the lookup index is completely useless (because it represents one token for each type with enormous overhead).

The unique ID of a token is its corpus position.  If you need your own unique ID, try to break it up into two or more non-unique parts and store them in separate ID attributes.  E.g. token ID = filename + sentence number (within file) + token number (within sentence).

Cheers,
Stefan



More information about the CWB mailing list