[CWB] Open-CWB and Unicode

Tue Dec 11 17:19:41 CET 2007

Hi,

as mentioned by Stefan, the utf8 handling in CWB is limited, but it's
possible to use it, and my corpora (Western and Eastern European
languages, Russian, Chinese and Japanese) all peacefully co-exist in
utf8, see http://corpus.leeds.ac.uk/internet.html  

Another problem worth mentioning is with setting the context size.  If
it's set in bytes, you'll get incorrect utf8 chars at the edges.  Use
sentences or words (set context 10 words)

Best wishes,
Serge

On Tue, 2007-12-11 at 16:53 +0100, Stefan Evert wrote:
> Hi!
> 
> > Is Open-CWB Unicode aware?
> 
> The CWB is not Unicode-aware; in particular, it doesn't fully support  
> regular expressions and case/diacritic-insensitive matching for  
> Unicode-encoded corpora.  The "Open CWB" is identical to the IMS  
> version you're familiar with, but hopefully will be extended soon  
> with true Unicode support.
> 
> > Or should corpora continue to be in Latin 1?
> 
> Ideally, corpora should be in Latin 1 to make full use of CQP's query  
> capabilities.  Other 8-bit ASCII extensions will also work quite  
> well, except that you can't use the %c and %d flags and latex escapes  
> for accented characters (unfortunately, CQP won't warn you about this  
> and just return nonsensical results, so be careful ...).
> 
> If you really need a wider range of characters, it's possible to  
> store UTF-8 encoded corpora in the CWB and query them with CQP, if  
> you keep a few precautions in mind:
> 
>   - You have to make sure yourself that the corpus and all queries  
> are normalised: combined characters are recommended, but it's most  
> important to be consistent.
>   - Of course, case- and diacritic-insensitive search won't work (and  
> the same holds for sorting and counting).
>   - Don't expect sort order to be sensible for your language: CQP is  
> just sorting byte sequences!
>   - Only a subset of regular expressions will work properly.  In  
> particular, "." is not guaranteed to match exactly one character (it  
> may match part of a multi-byte character), and character ranges are  
> equally useless (except for ASCII characters).  You can use regexp's  
> for simple wildcard searches like "un.+able", though.
> 
> Serge Sharoff keeps all his CWB corpora in UTF-8 nowadays, so he  
> should have a lot of experience with this.
> 
> Best,
> Stefan
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb