[CWB] Open-CWB and Unicode

Alberto Simões albie at alfarrabio.di.uminho.pt
Tue Dec 11 19:23:09 CET 2007


Thanks for your comments, Stefan.
I'll probably stick to latin 1 for now :)

Thanks

Stefan Evert wrote:
> Hi!
> 
>> Is Open-CWB Unicode aware?
> 
> The CWB is not Unicode-aware; in particular, it doesn't fully support 
> regular expressions and case/diacritic-insensitive matching for 
> Unicode-encoded corpora.  The "Open CWB" is identical to the IMS version 
> you're familiar with, but hopefully will be extended soon with true 
> Unicode support.
> 
>> Or should corpora continue to be in Latin 1?
> 
> Ideally, corpora should be in Latin 1 to make full use of CQP's query 
> capabilities.  Other 8-bit ASCII extensions will also work quite well, 
> except that you can't use the %c and %d flags and latex escapes for 
> accented characters (unfortunately, CQP won't warn you about this and 
> just return nonsensical results, so be careful ...).
> 
> If you really need a wider range of characters, it's possible to store 
> UTF-8 encoded corpora in the CWB and query them with CQP, if you keep a 
> few precautions in mind:
> 
>  - You have to make sure yourself that the corpus and all queries are 
> normalised: combined characters are recommended, but it's most important 
> to be consistent.
>  - Of course, case- and diacritic-insensitive search won't work (and the 
> same holds for sorting and counting).
>  - Don't expect sort order to be sensible for your language: CQP is just 
> sorting byte sequences!
>  - Only a subset of regular expressions will work properly.  In 
> particular, "." is not guaranteed to match exactly one character (it may 
> match part of a multi-byte character), and character ranges are equally 
> useless (except for ASCII characters).  You can use regexp's for simple 
> wildcard searches like "un.+able", though.
> 
> Serge Sharoff keeps all his CWB corpora in UTF-8 nowadays, so he should 
> have a lot of experience with this.
> 
> Best,
> Stefan

-- 
Alberto Simões - Departamento de Informática - Universidade do Minho
                  Campus de Gualtar - 4710-057 Braga - Portugal


More information about the CWB mailing list