[CWB] Does cwb-align-encode support utf8?

Hardie, Andrew a.hardie at lancaster.ac.uk
Tue Jul 17 13:25:13 CEST 2012


>>>I'm not sure that CQP is the right place to implement fully automatic charset transformation -- I don't think even an expensive commercial RDMBS would provide that level of convenience functionality.  This is especially true if it would only be targeted at the interactive command-line mode.  Web GUIs that use CQP as a backend or access corpora directly through the C-level library or CQi (e.g. the Perl API, the Python API and rcqp) wouldn't profit from this at all.  

No, anything with a CQP backend will benefit, though the CL won't, that's true (though with a CL backend the user is supposed to be monitoring and dealing with the ->charset member of the Corpus object themselves.)

>>>Plus, we'd have to start working around all those bugs in iconv() on various platforms that the R developers are constantly complaining about ...

Now that's scary! OK, no iconv, just cleansing with "?" to make sure the aligned data does not mismatch the original data in terms of encoding.

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Stefan Evert
Sent: 14 July 2012 09:55
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Does cwb-align-encode support utf8?


> You'll get no argument from me on that one! but we do have a somewhat established pattern of going to insane lengths in the name of backwards compatibility e.g. the day or two I spent last year coding up case- and accent-folding tables for the 8859 charsets...

... and that was time well spent, in my opinion, because we can now provide full support for our legacy users who have many (and often very large) corpora consistently encoded in Latin1, Latin2 or similar.

I'm not sure that CQP is the right place to implement fully automatic charset transformation -- I don't think even an expensive commercial RDMBS would provide that level of convenience functionality.  This is especially true if it would only be targeted at the interactive command-line mode.  Web GUIs that use CQP as a backend or access corpora directly through the C-level library or CQi (e.g. the Perl API, the Python API and rcqp) wouldn't profit from this at all.  Plus, we'd have to start working around all those bugs in iconv() on various platforms that the R developers are constantly complaining about ...

Just my 2 cents ...
Stefan

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list