[CWB] charset question
Stefan Evert
stefan.evert at uos.de
Wed Aug 1 11:01:53 CEST 2007
Hi!
Thanks for answering this question, Serge. Let me just add a few
comments from an "inside" perspective. :o)
As Serge pointed out, CQP doesn't do any charset conversion for you,
so any data you send to or receive from CQP has to be in the same
encoding as the corpus you use. The best solution is to make sure
that you're always working on Unicode strings internally (depending
on what CGI interface you use, you might already get Unicode strings,
otherwise you will have to decode them from your UTF-8 input), and
then explicitly encode and decode when you communicate with CQP.
I.e., before you send a command to CQP, encode the Unicode string
holding the command to ISO-8859-5 (BTW, the charset code you have to
use in the registry file is "cyrillic"; the next CWB release will
also support the "iso-8859-5" alias), and the decode all output you
read from CQP from ISO-8859-5 to Unicode. Perl's Encode module does
an excellent job there, see
perldoc Encode
> Btw: We also have several corpora encoded in iso-8859-2, where I
> managed
> to get results by applying a (ugly) hard-coded conversion table which
> maps "\x{xxxx}" notation to octal representation. Of course I could do
> that for the iso-8859-5 corpus, too, but I'm looking for a more
> "universal" solution.
Yes, that's definitely an ugly solution. :o) Again, you should use
the Encode module for such translations, which is not only more
universal and readily available in the standard library, but should
also be much faster than your hack because the most widely used
encoders are written in C.
Future versions of the CWB/Perl interface are planned to provide a
Unicode frontend, which handles encoding/decoding to the native
charset of the corpus transparently (that's very easy at that level,
because you can just push an appropriate encoding layer on the I/O
streams, long live Perl 5.8! ;-). The reason I haven't done that yet
is that a general transition to Unicode would break lots of existing
scripts and is also less efficient if you work in a single 8-bit
encoding most of the time; so we will have to provide to alternative
interfaces. Well, the Perl modules will need a big overhaul for the
3.0 release anyway ...
> no, it doesn't. You can't use utf-8 in regular expressions, for
> instance, or perform case-insenstive search. But CWB doesn't harm
> encodings either, so it can work with Chinese and Japanese in exactly
> the way.
Exactly, CQP treats all strings as byte sequences, so as long as you
have properly null-terminated strings, it will happily work with
UTF-8. Don't try to use %c or %d flags, as this would ruin the
Unicode characters. As Serge pointed out, regular expressions don't
work for Unicode data at the moment, or rather in a very limited
way. In particular, "." is not guaranteed to match exactly one
character, and character classes containing Unicode characters will
give completely nonsensical results. Simple alternatives, optional
elements and prefix/suffix search with .* should work fine, though.
Best regards,
Stefan Evert
[ stefan.evert at uos.de | http://purl.org/stefan.evert ]
More information about the CWB
mailing list