[CWB] CQP encoding problem

Stefan Evert stefanML at collocations.de
Fri Dec 9 11:28:06 CET 2011


> I've been trying to use CQP 3.2.5 in Windows for querying a Russian
> corpus saved in utf-8, but it crashes every time I enter unicode
> symbols as a query. I could find some information on this in FAQ:
> http://cwb.sourceforge.net/faq.php?hoist=windows_terminal#windows_terminal
> 
> "Input of accented characters in UTF-8 doesn't seem to work
> (regardless of codepage setting), and such queries may crash CQP."
> Does it mean that it's impossible to use utf-8 symbols for querying the corpus?

Yes, I'm afraid the current status seems to be that you can't enter UTF-8 strings in an interactive CQP session in Windows.  We don't have many Windows users so far, and I believe most of them are using CQP as a backend process in CQPweb or through the CQi client-server API.

IIRC, the key problem was that you can write "chcp 65001" in the Windows command shell to display UTF-8 _output_, but that this doesn't change the character encoding for _input_.

As a very clumsy workaround, you can enter the (hexadecimal) numeric codes of the Unicode characters using PCRE's \x{....} notation, but you'd have to write a wrapper that does this automatically for strings if you intend to do any serious work in CQP.

A quick Web search suggests that there may be better long-term solutions, but we'd need an experienced Windows user/developer to implement and test these.  Anyone interested in giving this a shot?

 - Console2 in combination with a recent version of the Windows Powershell (from the Community Tech Preview) is said to work better with UTF-8: http://stackoverflow.com/questions/379240/is-there-a-windows-command-shell-that-will-display-unicode-characters

 - The Win32 API seems to allow programs to set input/output encodings of the shell, so perhaps CQP just needs to call SetConsoleCP(), SetConsoleOutputCP(), etc. with the right arguments: http://msdn.microsoft.com/en-us/library/ms686013(VS.85).aspx

Best wishes,
Stefan


More information about the CWB mailing list