[CWB] charset question
Dieter Schicker
dieter.schicker at uni-graz.at
Wed Aug 1 07:23:29 CEST 2007
Hi again,
I got another question concerning charset issues. We have a russian
corpus encoded in iso-8859-5 and a small web interface with
<form accept-charset="utf-8" enctype="application/x-www-form-urlencoded" ...
that sends queries to the cqp. On the server side we use the perl
modules provided by the cwb distribution. The main problem is that
whatever query I send to the cqp it doesn't find anything.
Here's an example of a query: "они", which - in perl syntax - looks like
"\x{043E}\x{043D}\x{0438}" => no results.
So, my question is: How do I have to encode/transform the query string
so that cqp "understands" it? Maybe someone can point me in the right
direction.
Btw: We also have several corpora encoded in iso-8859-2, where I managed
to get results by applying a (ugly) hard-coded conversion table which
maps "\x{xxxx}" notation to octal representation. Of course I could do
that for the iso-8859-5 corpus, too, but I'm looking for a more
"universal" solution.
Many thanks in advance
Dieter
--
Dieter Schicker
Department of Information Processing in the Humanities
Karl Franzens University of Graz
Merangasse 70
A-8010 Graz
++43(0)316-380-8012
More information about the CWB
mailing list