[CWB] Does cwb-align-encode support utf8?

Stefan Evert stefanML at collocations.de
Sat Jul 14 10:55:02 CEST 2012


> You'll get no argument from me on that one! but we do have a somewhat established pattern of going to insane lengths in the name of backwards compatibility e.g. the day or two I spent last year coding up case- and accent-folding tables for the 8859 charsets...

... and that was time well spent, in my opinion, because we can now provide full support for our legacy users who have many (and often very large) corpora consistently encoded in Latin1, Latin2 or similar.

I'm not sure that CQP is the right place to implement fully automatic charset transformation -- I don't think even an expensive commercial RDMBS would provide that level of convenience functionality.  This is especially true if it would only be targeted at the interactive command-line mode.  Web GUIs that use CQP as a backend or access corpora directly through the C-level library or CQi (e.g. the Perl API, the Python API and rcqp) wouldn't profit from this at all.  Plus, we'd have to start working around all those bugs in iconv() on various platforms that the R developers are constantly complaining about ...

Just my 2 cents ...
Stefan



More information about the CWB mailing list