[CWB] CL: Error, unrecognised CorpusCharset in cl_string_validate_encoding

luigi.talamo at libero.it luigi.talamo at libero.it
Wed Apr 6 13:42:57 CEST 2011


Hardie, Andrew wrote:

> The best thing to do in this case would be to recode the corpus and re-
index, either in 8859-5 (BUT: be aware full support for this charset is not yet 
implemented e.g. CWB doesn't yet have any knowledge of case/accent folding for 
8859-5) or - better yet - in UTF-8.

I guess utf-8 has a good mapping of cp1251:

http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1251.TXT

making possible a less painful re-encoding. 
You just have to open the cp1251 encoded file in a text-editor which is utf-8 
compliant and save the file with the new encoding system. 
Let's know if you encounter some troubles.

Best regards,

Luigi
 




More information about the CWB mailing list