[CWB] charset question

Dieter Schicker dieter.schicker at uni-graz.at
Wed Aug 1 07:23:29 CEST 2007


Hi again,

I got another question concerning charset issues. We have a russian
corpus encoded in iso-8859-5 and a small web interface with

<form accept-charset="utf-8" enctype="application/x-www-form-urlencoded" ...


that sends queries to the cqp. On the server side we use the perl
modules provided by the cwb distribution. The main problem is that
whatever query I send to the cqp it doesn't find anything.

Here's an example of a query: "они", which - in perl syntax - looks like
"\x{043E}\x{043D}\x{0438}" => no results.

So, my question is: How do I have to encode/transform the query string
so that cqp "understands" it? Maybe someone can point me in the right
direction.

Btw: We also have several corpora encoded in iso-8859-2, where I managed
to get results by applying a (ugly) hard-coded conversion table which
maps "\x{xxxx}" notation to octal representation. Of course I could do
that for the iso-8859-5 corpus, too, but I'm looking for a more
"universal" solution.

Many thanks in advance
Dieter

-- 
Dieter Schicker
Department of Information Processing in the Humanities
Karl Franzens University of Graz
Merangasse 70
A-8010 Graz
++43(0)316-380-8012



More information about the CWB mailing list