[CWB] Encoding error in Windows

Sun Apr 10 01:10:20 CEST 2011

As I noted in a previous mail, the ???? message indicates that the console is not passing well-formed UTF-8 characters to CQP. Changing the cmd.exe code page to UTF-8 before running CQP may help (chcp 65001). 

The -c option for cwb-encode is documented in cwb-encode -h, but not yet in man cwb-encode. The corpus encoding tutorial document is still targeted at v3.0 which does not have Unicode (or Windows) support. 

best

Andrew.

________________________________

From: George Goce Mitrevski [mailto:podmocani at yahoo.com] 
Sent: 09 April 2011 16:39
To: Hardie, Andrew; Open source development of the Corpus WorkBench
Subject: Re: [CWB] Encoding error in Windows

	Andrew,
	Thanks for the suggestion. It may be a good idea to include this info in the instruction page for cwb-encode. The corpus was encoded just fine. However, I'm still having hell of a problem getting cqp to accept Cyrillic character encoding even in utf8. Has anyone been successful in encoding and searching a cyrillic corpus in Windows? I didn't encounter any such problems on Unix. Below is my encoding script and the search error:
	cwb-encode -d "C:\CWB\ANGELINA\data" -f "C:\CWB\ANGELINA\angelina.txt" -c utf8 -R "C:\CWB\registry\angelina" -xsB -S s:0 -S text:0+id+title+author+genre -S subject:0 -S publisher:0 -S dateOrigonal:0 -S dateDigital:0 -S identifier:0 -S citation:0 -S source:0 -S relation:0 -S hasPart:0 -S isPartOf:0

	C:\Windows\system32>cqp
	[no corpus]> ANGELINA;
	ANGELINA> "што";
	CL: Regex Compile Error: unrecognized character after (? or (?-
	CQP Error:
	        Illegal regular expression: ???

	Regards,
	George.

________________________________

	From: "Hardie, Andrew" <a.hardie at lancaster.ac.uk>
	To: Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it>
	Cc: George Goce Mitrevski <podmocani at yahoo.com>
	Sent: Friday, April 8, 2011 4:15 PM
	Subject: Re: [CWB] Encoding error in Windows

	It means the encoding hasn't been set to utf8. This is possibly because you haven't specified the encoding using -c utf8 (cwb-encode defaults to Latin-1 if not told specifically what encoding to use) 

	On the other hand, if you have specified that it is utf-8, then it may be a bug. If this is the case, could you specify precisely what command line you've been using? Thanks.

	best

	Andrew.

________________________________

		From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of George Goce Mitrevski
		Sent: 08 April 2011 22:09
		To: Open source development of the Corpus WorkBench
		Subject: [CWB] Encoding error in Windows

		Can someone please explain what's causing this encoding error when I try to encode corpus in Window in utf8?

		"Encoding error: an invalid byte or byte sequence for charset "latin1" was encountered."

		Thanks much.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20110410/ada60782/attachment.htm