[CWB] CL: Error, unrecognised CorpusCharset in cl_string_validate_encoding

Hardie, Andrew a.hardie at lancaster.ac.uk
Wed Apr 6 11:09:10 CEST 2011


Hi George,

Unfortunately, cp1251 is not supported and there are no plans to support it in the future.  "cyrillic" is a synonym for "iso-8859-5". Unlike the case with cp1252 and iso-8859-1, 1251 and 8859-5 are *not* compatible with one another. So, that's the cause of the second error: CWB just doesn't recognise "cp1251". (It should actually have failed well before that point, in theory.... it shouldn't have even let you select a corpus with a bad charset property.)

The cause of the first error, on the other hand, seems to be charset incompatibility between what your console is sending in and what CQP is expecting. It looks as if what you're typing is being translated by the cmd.exe console into Latin-1, where it comes out as a string of questions marks ???? , thus creating an invalid regular expression (a regex can't start with a question mark).

The best thing to do in this case would be to recode the corpus and re-index, either in 8859-5 (BUT: be aware full support for this charset is not yet implemented e.g. CWB doesn't yet have any knowledge of case/accent folding for 8859-5) or - better yet - in UTF-8.

best

Andrew.



-----------------------
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of George Mitrevski
Sent: 05 April 2011 01:18
To: cwb at sslmit.unibo.it
Subject: [CWB] CL: Error,unrecognised CorpusCharset in cl_string_validate_encoding

Hi everyone.
I am trying to access a corpus in cyrillic (cp1251) in Windows with cqp.exe. I got the cqp.exe window to accept cyrillic characters, but now I encountered another problem.

 In the registry I change to charset to "cyrillic" and I get this error:

MKCORPUS> "кога";
CL: Regex Compile Error: unrecognized character after (? or (?-
CQP Error:
        Illegal regular expression: ????


When I change the charset to "cp1251", I get this error

MKCORPUS> "кога";
CL: Error, unrecognised CorpusCharset in cl_string_validate_encoding.
CQP Error:
        Query includes a character or character sequence that is invalid
in the encoding specified for this corpus


Someone else reported a similar problem with the charset here http://liste.sslmit.unibo.it/pipermail/cwb/2007-July/000077.html and the advice given was 

All you have to do is keep the "##::" and change the charset value to  
"latin2" (CQP won't understand iso-8859-2), like so:

What should I set the charset value to so that cqp can understand cyrillic texts?

Thanks much.
-- 
Dr. George Mitrevski
Professor Emeritus
Auburn University
Website: http://www.auburn.edu/~mitrege
Macedonian Higher Education Blog: http://visokoobrazovanie.blogspot.com/



More information about the CWB mailing list