[CWB] Open-CWB and Unicode

Tue Dec 11 16:53:49 CET 2007

Hi!

> Is Open-CWB Unicode aware?

The CWB is not Unicode-aware; in particular, it doesn't fully support  
regular expressions and case/diacritic-insensitive matching for  
Unicode-encoded corpora.  The "Open CWB" is identical to the IMS  
version you're familiar with, but hopefully will be extended soon  
with true Unicode support.

> Or should corpora continue to be in Latin 1?

Ideally, corpora should be in Latin 1 to make full use of CQP's query  
capabilities.  Other 8-bit ASCII extensions will also work quite  
well, except that you can't use the %c and %d flags and latex escapes  
for accented characters (unfortunately, CQP won't warn you about this  
and just return nonsensical results, so be careful ...).

If you really need a wider range of characters, it's possible to  
store UTF-8 encoded corpora in the CWB and query them with CQP, if  
you keep a few precautions in mind:

  - You have to make sure yourself that the corpus and all queries  
are normalised: combined characters are recommended, but it's most  
important to be consistent.
  - Of course, case- and diacritic-insensitive search won't work (and  
the same holds for sorting and counting).
  - Don't expect sort order to be sensible for your language: CQP is  
just sorting byte sequences!
  - Only a subset of regular expressions will work properly.  In  
particular, "." is not guaranteed to match exactly one character (it  
may match part of a multi-byte character), and character ranges are  
equally useless (except for ASCII characters).  You can use regexp's  
for simple wildcard searches like "un.+able", though.

Serge Sharoff keeps all his CWB corpora in UTF-8 nowadays, so he  
should have a lot of experience with this.

Best,
Stefan