[CWB] Open-CWB and Unicode
Alberto Simões
albie at alfarrabio.di.uminho.pt
Tue Dec 11 19:23:09 CET 2007
Thanks for your comments, Stefan.
I'll probably stick to latin 1 for now :)
Thanks
Stefan Evert wrote:
> Hi!
>
>> Is Open-CWB Unicode aware?
>
> The CWB is not Unicode-aware; in particular, it doesn't fully support
> regular expressions and case/diacritic-insensitive matching for
> Unicode-encoded corpora. The "Open CWB" is identical to the IMS version
> you're familiar with, but hopefully will be extended soon with true
> Unicode support.
>
>> Or should corpora continue to be in Latin 1?
>
> Ideally, corpora should be in Latin 1 to make full use of CQP's query
> capabilities. Other 8-bit ASCII extensions will also work quite well,
> except that you can't use the %c and %d flags and latex escapes for
> accented characters (unfortunately, CQP won't warn you about this and
> just return nonsensical results, so be careful ...).
>
> If you really need a wider range of characters, it's possible to store
> UTF-8 encoded corpora in the CWB and query them with CQP, if you keep a
> few precautions in mind:
>
> - You have to make sure yourself that the corpus and all queries are
> normalised: combined characters are recommended, but it's most important
> to be consistent.
> - Of course, case- and diacritic-insensitive search won't work (and the
> same holds for sorting and counting).
> - Don't expect sort order to be sensible for your language: CQP is just
> sorting byte sequences!
> - Only a subset of regular expressions will work properly. In
> particular, "." is not guaranteed to match exactly one character (it may
> match part of a multi-byte character), and character ranges are equally
> useless (except for ASCII characters). You can use regexp's for simple
> wildcard searches like "un.+able", though.
>
> Serge Sharoff keeps all his CWB corpora in UTF-8 nowadays, so he should
> have a lot of experience with this.
>
> Best,
> Stefan
--
Alberto Simões - Departamento de Informática - Universidade do Minho
Campus de Gualtar - 4710-057 Braga - Portugal
More information about the CWB
mailing list