[CWB] Open-CWB and Unicode
Stefan Evert
stefan.evert at uos.de
Tue Dec 11 16:53:49 CET 2007
Hi!
> Is Open-CWB Unicode aware?
The CWB is not Unicode-aware; in particular, it doesn't fully support
regular expressions and case/diacritic-insensitive matching for
Unicode-encoded corpora. The "Open CWB" is identical to the IMS
version you're familiar with, but hopefully will be extended soon
with true Unicode support.
> Or should corpora continue to be in Latin 1?
Ideally, corpora should be in Latin 1 to make full use of CQP's query
capabilities. Other 8-bit ASCII extensions will also work quite
well, except that you can't use the %c and %d flags and latex escapes
for accented characters (unfortunately, CQP won't warn you about this
and just return nonsensical results, so be careful ...).
If you really need a wider range of characters, it's possible to
store UTF-8 encoded corpora in the CWB and query them with CQP, if
you keep a few precautions in mind:
- You have to make sure yourself that the corpus and all queries
are normalised: combined characters are recommended, but it's most
important to be consistent.
- Of course, case- and diacritic-insensitive search won't work (and
the same holds for sorting and counting).
- Don't expect sort order to be sensible for your language: CQP is
just sorting byte sequences!
- Only a subset of regular expressions will work properly. In
particular, "." is not guaranteed to match exactly one character (it
may match part of a multi-byte character), and character ranges are
equally useless (except for ASCII characters). You can use regexp's
for simple wildcard searches like "un.+able", though.
Serge Sharoff keeps all his CWB corpora in UTF-8 nowadays, so he
should have a lot of experience with this.
Best,
Stefan
More information about the CWB
mailing list