[CWB] change charset to latin1

Stefan Evert stefanML at collocations.de
Tue Mar 9 19:08:29 CET 2010


To complement Andrew's explanation:

> I want to change uncommenting the sentence:
> :: charset = "latin1"
> or
> charset = "latin1"

This isn't valid registry file syntax.

> I have a corpus and the diacritic argument (%d) doesn't run. I think
> that my charset is UTF8 because I look the commented sentence in the
> registry:
> ##:: charset  = "latin1" # character encoding of corpus data

Actually, this is not a comment -- although it looks like one -- but  
rather a "corpus property", i.e. a key-value pair that specifies  
corpus metadata.  The registry file parsers recognises ##:: as a  
special token that starts a corpus property definition.

The reason for this peculiar format is backwards compatibility to  
earlier CWB versions, which was a big issue when we were working on  
CQP at the IMS (because most people would use a stable release and we  
had to make sure everything still worked for them).

Best,
Stefan



More information about the CWB mailing list