[CWB] charset question

Dieter Schicker dieter.schicker at uni-graz.at
Wed Aug 1 10:21:11 CEST 2007


Oh, thanks a lot. I'am a little bit surprised that the cwb supports utf-8!

Regards,
Dieter

Serge Sharoff wrote:
> Hi,
>
> I think that here the problem is with mixing the utf8 encoding in the
> interface which doesn't get into iso-8859-5 in your interface.  If your
> corpus is encoded in utf8 and you do utf8::decode for every result
> returned by CQP, everything works fine, have a look:
> http://corpus.leeds.ac.uk/ruscorpora.html
>
> the interface script is available in open-source from http://csar.sf.net
>
> Best,
> Serge
>
> On Wed, 2007-08-01 at 07:23 +0200, Dieter Schicker wrote:
>   
>> Hi again,
>>
>> I got another question concerning charset issues. We have a russian
>> corpus encoded in iso-8859-5 and a small web interface with
>>
>> <form accept-charset="utf-8" enctype="application/x-www-form-urlencoded" ...
>>
>>
>> that sends queries to the cqp. On the server side we use the perl
>> modules provided by the cwb distribution. The main problem is that
>> whatever query I send to the cqp it doesn't find anything.
>>
>> Here's an example of a query: "они", which - in perl syntax - looks like
>> "\x{043E}\x{043D}\x{0438}" => no results.
>>
>> So, my question is: How do I have to encode/transform the query string
>> so that cqp "understands" it? Maybe someone can point me in the right
>> direction.
>>
>> Btw: We also have several corpora encoded in iso-8859-2, where I managed
>> to get results by applying a (ugly) hard-coded conversion table which
>> maps "\x{xxxx}" notation to octal representation. Of course I could do
>> that for the iso-8859-5 corpus, too, but I'm looking for a more
>> "universal" solution.
>>
>> Many thanks in advance
>> Dieter
>>
>>     
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>   



More information about the CWB mailing list