[CWB] charset question

Wed Oct 3 23:58:35 CEST 2007

Hi everyone!

Sorry about the late reply.

> just to add something I have just stumbled across:
>
> Whenever you use a UTF-8 encoded corpus, NEVER try to set Context  
> to XXX characters. Since in CQP a characters is a byte and in UTF-8  
> this is not necessarily the case, the query results WILL be in  
> broken UTF-8 because some chars will be sliced in half. Assuming  
> that the tokenizer you used on your corpus doesn't have the same  
> problem, you'll go fine with setting Context to XXX words.

Very well observed.  In my opinion, character context should be  
abolished anyway ;-), because it's the main reason why the "cat"  
command is slow and segfaults occasionally.  Things like that are  
much better done in Perl.

> It would be a nice feature to be able to tell the CQP that it's  
> working on UTF-8 and should thus expand the context window to  
> really include the full UTF-8 character. Has that been implemented  
> in some way?

No, it hasn't been implemented yet, and it won't be easy to do,  
because it'll require a complete rewrite of CQP's kwic formatting  
code (so that it knows about the differences between byte length and  
character length of the string it accumulates).  Terminal  
highlighting can also sometimes be problematic, even with non-UTF data.

I think proper UTF support is one of the top priorities for future  
CWB development. It might be useful to make a list of things that  
don't work properly with UTF-8 encoded corpora, perhaps in the CWB wiki?

What we definitely need are (i) a general Unicode library and (ii) a  
UTF-compatible regular expression library.  Suggestions for available  
libraries are very welcome, especially if they come with code  
samples :o) or a clear idea how difficult it is to use this library  
and link it to CQP.  I've recently been toying with the idea that we  
could use Gnome's glib instead of a dedicated Unicode implementation  
(such as ICU).  As far as I can tell so far, glib provides all the  
basic Unicode functionality we need (for regular expressions, we  
could use PCRE or Oniguruma).  It also does most of the platform  
configuration for us (figuring out endianness, finding a 32bit data  
type, etc.) and has some other useful utility functions.

Best to all of you,
Stefan