[CWB] charset question
Richard Eckart
eckart at linglit.tu-darmstadt.de
Wed Aug 29 15:06:25 CEST 2007
Hi there,
just to add something I have just stumbled across:
Whenever you use a UTF-8 encoded corpus, NEVER try to set Context to
XXX characters. Since in CQP a characters is a byte and in UTF-8 this
is not necessarily the case, the query results WILL be in broken
UTF-8 because some chars will be sliced in half. Assuming that the
tokenizer you used on your corpus doesn't have the same problem,
you'll go fine with setting Context to XXX words.
It would be a nice feature to be able to tell the CQP that it's
working on UTF-8 and should thus expand the context window to really
include the full UTF-8 character. Has that been implemented in some way?
Best regards,
Richard Eckart
Darmstadt University of Technology
Institute of Linguistics and Literary Studies
Department of English Linguistics
Hochschulstrasse 1
64289 Darmstadt
Germany
More information about the CWB
mailing list