[CWB] Encoding error in Windows

Hardie, Andrew a.hardie at lancaster.ac.uk
Sun Apr 10 03:35:11 CEST 2011


Does the same thing happen if you try
 
C:\Windows\system32>cqp
[no corpus]> set DataDirectory ".";
[no corpus]> ANGELINA;
ANGELINA> Query = "што";
ANGELINA>cat Query > "Query.txt";
 
?
 
Andrew.
 


________________________________

	From: George Goce Mitrevski [mailto:podmocani at yahoo.com] 
	Sent: 10 April 2011 02:28
	To: Hardie, Andrew; Open source development of the Corpus WorkBench
	Subject: Re: [CWB] Encoding error in Windows
	
	
	
	Andrew,
	Changing the codepage to UTF-8 doesn't seem to work either, cqp simply quits. Trying to write an UTF-8 string terminates cqp.
	
	
	C:\Windows\system32>chcp 65001
	Active code page: 65001

	C:\Windows\system32>cqp
	[no corpus]> ANGELINA;
	ANGELINA> "што";

	C:\Windows\system32>

	I did some internet search on chcp 65001, and it seems others have had problems with the console in chcp 65001. Here is what one person had to say: "CHCP 65001 indeed works quite well, but whenever I tried to run a batch file (pure ASCII!) from such a console, it never worked. There no output, no error error message, and the commands are not executed." Do we know if the Windows console in fact supports 65001?
	
	
	My original corpus was in CP1251 and others are convincing me to re-encode it in utf8, but now I'm wondering whether it's worth the effort if I can't work with it in Windows. When I change the Windows console codepage to 1251 (chcp 1251) I have no problems searching the corpus.
	
	
	Regards,
	George.
	
________________________________

	From: "Hardie, Andrew" <a.hardie at lancaster.ac.uk>
	To: George Goce Mitrevski <podmocani at yahoo.com>; Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it>
	Sent: Saturday, April 9, 2011 6:10 PM
	Subject: Re: [CWB] Encoding error in Windows
	
	
	As I noted in a previous mail, the ???? message indicates that the console is not passing well-formed UTF-8 characters to CQP. Changing the cmd.exe code page to UTF-8 before running CQP may help (chcp 65001). 
	 
	The -c option for cwb-encode is documented in cwb-encode -h, but not yet in man cwb-encode. The corpus encoding tutorial document is still targeted at v3.0 which does not have Unicode (or Windows) support. 
	 
	best
	 
	Andrew.
	 
	 
________________________________

	From: George Goce Mitrevski [mailto:podmocani at yahoo.com] 
	Sent: 09 April 2011 16:39
	To: Hardie, Andrew; Open source development of the Corpus WorkBench
	Subject: Re: [CWB] Encoding error in Windows
	
	

		Andrew,
		Thanks for the suggestion. It may be a good idea to include this info in the instruction page for cwb-encode. The corpus was encoded just fine. However, I'm still having hell of a problem getting cqp to accept Cyrillic character encoding even in utf8. Has anyone been successful in encoding and searching a cyrillic corpus in Windows? I didn't encounter any such problems on Unix. Below is my encoding script and the search error:
		cwb-encode -d "C:\CWB\ANGELINA\data" -f "C:\CWB\ANGELINA\angelina.txt" -c utf8 -R "C:\CWB\registry\angelina" -xsB -S s:0 -S text:0+id+title+author+genre -S subject:0 -S publisher:0 -S dateOrigonal:0 -S dateDigital:0 -S identifier:0 -S citation:0 -S source:0 -S relation:0 -S hasPart:0 -S isPartOf:0

		C:\Windows\system32>cqp
		[no corpus]> ANGELINA;
		ANGELINA> "што";
		CL: Regex Compile Error: unrecognized character after (? or (?-
		CQP Error:
		        Illegal regular expression: ???

		Regards,
		George.

		
________________________________

		From: "Hardie, Andrew" <a.hardie at lancaster.ac.uk>
		To: Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it>
		Cc: George Goce Mitrevski <podmocani at yahoo.com>
		Sent: Friday, April 8, 2011 4:15 PM
		Subject: Re: [CWB] Encoding error in Windows
		
		
		It means the encoding hasn't been set to utf8. This is possibly because you haven't specified the encoding using -c utf8 (cwb-encode defaults to Latin-1 if not told specifically what encoding to use) 
		 
		On the other hand, if you have specified that it is utf-8, then it may be a bug. If this is the case, could you specify precisely what command line you've been using? Thanks.
		 
		best
		 
		Andrew.


________________________________

			From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of George Goce Mitrevski
			Sent: 08 April 2011 22:09
			To: Open source development of the Corpus WorkBench
			Subject: [CWB] Encoding error in Windows
			
			
			Can someone please explain what's causing this encoding error when I try to encode corpus in Window in utf8?

			"Encoding error: an invalid byte or byte sequence for charset "latin1" was encountered."
			
			
			
			Thanks much.





-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20110410/3ce147bd/attachment.htm


More information about the CWB mailing list