[CWB] charset question

Wed Aug 1 11:01:53 CEST 2007

Hi!

Thanks for answering this question, Serge.  Let me just add a few  
comments from an "inside" perspective. :o)

As Serge pointed out, CQP doesn't do any charset conversion for you,  
so any data you send to or receive from CQP has to be in the same  
encoding as the corpus you use.  The best solution is to make sure  
that you're always working on Unicode strings internally (depending  
on what CGI interface you use, you might already get Unicode strings,  
otherwise you will have to decode them from your UTF-8 input), and  
then explicitly encode and decode when you communicate with CQP.   
I.e., before you send a command to CQP, encode the Unicode string  
holding the command to ISO-8859-5 (BTW, the charset code you have to  
use in the registry file is "cyrillic"; the next CWB release will  
also support the "iso-8859-5" alias), and the decode all output you  
read from CQP from ISO-8859-5 to Unicode.  Perl's Encode module does  
an excellent job there, see

   perldoc Encode

> Btw: We also have several corpora encoded in iso-8859-2, where I  
> managed
> to get results by applying a (ugly) hard-coded conversion table which
> maps "\x{xxxx}" notation to octal representation. Of course I could do
> that for the iso-8859-5 corpus, too, but I'm looking for a more
> "universal" solution.

Yes, that's definitely an ugly solution. :o) Again, you should use  
the Encode module for such translations, which is not only more  
universal and readily available in the standard library, but should  
also be much faster than your hack because the most widely used  
encoders are written in C.

Future versions of the CWB/Perl interface are planned to provide a  
Unicode frontend, which handles encoding/decoding to the native  
charset of the corpus transparently (that's very easy at that level,  
because you can just push an appropriate encoding layer on the I/O  
streams, long live Perl 5.8! ;-).  The reason I haven't done that yet  
is that a general transition to Unicode would break lots of existing  
scripts and is also less efficient if you work in a single 8-bit  
encoding most of the time; so we will have to provide to alternative  
interfaces.  Well, the Perl modules will need a big overhaul for the  
3.0 release anyway ...

> no, it doesn't.  You can't use utf-8 in regular expressions, for
> instance, or perform case-insenstive search.  But CWB doesn't harm
> encodings either, so it can work with Chinese and Japanese in exactly
> the way.

Exactly, CQP treats all strings as byte sequences, so as long as you  
have properly null-terminated strings, it will happily work with  
UTF-8.  Don't try to use %c or %d flags, as this would ruin the  
Unicode characters.  As Serge pointed out, regular expressions don't  
work for Unicode data at the moment, or rather in a very limited  
way.  In particular, "." is not guaranteed to match exactly one  
character, and character classes containing Unicode characters will  
give completely nonsensical results.  Simple alternatives, optional  
elements and prefix/suffix search with .* should work fine, though.

Best regards,
Stefan Evert

[ stefan.evert at uos.de | http://purl.org/stefan.evert ]