[CWB] charset question

Richard Eckart eckart at linglit.tu-darmstadt.de
Wed Aug 29 15:06:25 CEST 2007


Hi there,

just to add something I have just stumbled across:

Whenever you use a UTF-8 encoded corpus, NEVER try to set Context to  
XXX characters. Since in CQP a characters is a byte and in UTF-8 this  
is not necessarily the case, the query results WILL be in broken  
UTF-8 because some chars will be sliced in half. Assuming that the  
tokenizer you used on your corpus doesn't have the same problem,  
you'll go fine with setting Context to XXX words.

It would be a nice feature to be able to tell the CQP that it's  
working on UTF-8 and should thus expand the context window to really  
include the full UTF-8 character. Has that been implemented in some way?

Best regards,

Richard Eckart

Darmstadt University of Technology
Institute of Linguistics and Literary Studies
Department of English Linguistics

Hochschulstrasse 1
64289 Darmstadt
Germany





More information about the CWB mailing list