[CWB] charset question
Stefan Evert
stefan.evert at uos.de
Wed Oct 3 23:58:35 CEST 2007
Hi everyone!
Sorry about the late reply.
> just to add something I have just stumbled across:
>
> Whenever you use a UTF-8 encoded corpus, NEVER try to set Context
> to XXX characters. Since in CQP a characters is a byte and in UTF-8
> this is not necessarily the case, the query results WILL be in
> broken UTF-8 because some chars will be sliced in half. Assuming
> that the tokenizer you used on your corpus doesn't have the same
> problem, you'll go fine with setting Context to XXX words.
Very well observed. In my opinion, character context should be
abolished anyway ;-), because it's the main reason why the "cat"
command is slow and segfaults occasionally. Things like that are
much better done in Perl.
> It would be a nice feature to be able to tell the CQP that it's
> working on UTF-8 and should thus expand the context window to
> really include the full UTF-8 character. Has that been implemented
> in some way?
No, it hasn't been implemented yet, and it won't be easy to do,
because it'll require a complete rewrite of CQP's kwic formatting
code (so that it knows about the differences between byte length and
character length of the string it accumulates). Terminal
highlighting can also sometimes be problematic, even with non-UTF data.
I think proper UTF support is one of the top priorities for future
CWB development. It might be useful to make a list of things that
don't work properly with UTF-8 encoded corpora, perhaps in the CWB wiki?
What we definitely need are (i) a general Unicode library and (ii) a
UTF-compatible regular expression library. Suggestions for available
libraries are very welcome, especially if they come with code
samples :o) or a clear idea how difficult it is to use this library
and link it to CQP. I've recently been toying with the idea that we
could use Gnome's glib instead of a dedicated Unicode implementation
(such as ICU). As far as I can tell so far, glib provides all the
basic Unicode functionality we need (for regular expressions, we
could use PCRE or Oniguruma). It also does most of the platform
configuration for us (figuring out endianness, finding a 32bit data
type, etc.) and has some other useful utility functions.
Best to all of you,
Stefan
More information about the CWB
mailing list