[CWB] charset question

Serge Sharoff s.sharoff at leeds.ac.uk
Wed Aug 1 10:26:02 CEST 2007


no, it doesn't.  You can't use utf-8 in regular expressions, for
instance, or perform case-insenstive search.  But CWB doesn't harm
encodings either, so it can work with Chinese and Japanese in exactly
the way.
Best,
Serge
On Wed, 2007-08-01 at 10:21 +0200, Dieter Schicker wrote:
> Oh, thanks a lot. I'am a little bit surprised that the cwb supports utf-8!
> 
> Regards,
> Dieter
> 
> Serge Sharoff wrote:
> > Hi,
> >
> > I think that here the problem is with mixing the utf8 encoding in the
> > interface which doesn't get into iso-8859-5 in your interface.  If your
> > corpus is encoded in utf8 and you do utf8::decode for every result
> > returned by CQP, everything works fine, have a look:
> > http://corpus.leeds.ac.uk/ruscorpora.html
> >
> > the interface script is available in open-source from http://csar.sf.net
> >
> > Best,
> > Serge
> >
> > On Wed, 2007-08-01 at 07:23 +0200, Dieter Schicker wrote:
> >   
> >> Hi again,
> >>
> >> I got another question concerning charset issues. We have a russian
> >> corpus encoded in iso-8859-5 and a small web interface with
> >>
> >> <form accept-charset="utf-8" enctype="application/x-www-form-urlencoded" ...
> >>
> >>
> >> that sends queries to the cqp. On the server side we use the perl
> >> modules provided by the cwb distribution. The main problem is that
> >> whatever query I send to the cqp it doesn't find anything.
> >>
> >> Here's an example of a query: "они", which - in perl syntax - looks like
> >> "\x{043E}\x{043D}\x{0438}" => no results.
> >>
> >> So, my question is: How do I have to encode/transform the query string
> >> so that cqp "understands" it? Maybe someone can point me in the right
> >> direction.
> >>
> >> Btw: We also have several corpora encoded in iso-8859-2, where I managed
> >> to get results by applying a (ugly) hard-coded conversion table which
> >> maps "\x{xxxx}" notation to octal representation. Of course I could do
> >> that for the iso-8859-5 corpus, too, but I'm looking for a more
> >> "universal" solution.
> >>
> >> Many thanks in advance
> >> Dieter
> >>
> >>     
> > _______________________________________________
> > CWB mailing list
> > CWB at sslmit.unibo.it
> > http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> >   
> 
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list