[CWB] charset question

Wed Aug 1 08:38:29 CEST 2007

Hi,

I think that here the problem is with mixing the utf8 encoding in the
interface which doesn't get into iso-8859-5 in your interface.  If your
corpus is encoded in utf8 and you do utf8::decode for every result
returned by CQP, everything works fine, have a look:
http://corpus.leeds.ac.uk/ruscorpora.html

the interface script is available in open-source from http://csar.sf.net

Best,
Serge

On Wed, 2007-08-01 at 07:23 +0200, Dieter Schicker wrote:
> Hi again,
> 
> I got another question concerning charset issues. We have a russian
> corpus encoded in iso-8859-5 and a small web interface with
> 
> <form accept-charset="utf-8" enctype="application/x-www-form-urlencoded" ...
> 
> 
> that sends queries to the cqp. On the server side we use the perl
> modules provided by the cwb distribution. The main problem is that
> whatever query I send to the cqp it doesn't find anything.
> 
> Here's an example of a query: "они", which - in perl syntax - looks like
> "\x{043E}\x{043D}\x{0438}" => no results.
> 
> So, my question is: How do I have to encode/transform the query string
> so that cqp "understands" it? Maybe someone can point me in the right
> direction.
> 
> Btw: We also have several corpora encoded in iso-8859-2, where I managed
> to get results by applying a (ugly) hard-coded conversion table which
> maps "\x{xxxx}" notation to octal representation. Of course I could do
> that for the iso-8859-5 corpus, too, but I'm looking for a more
> "universal" solution.
> 
> Many thanks in advance
> Dieter
>