[CWB] web-interface with aligned corpora and WebCqp::Persistent
Stefan Evert
stefan.evert at uos.de
Thu Feb 22 00:08:41 CET 2007
> I've heard Good Things about IBMs ICU (http://www-306.ibm.com/
> software/globalization/icu/index.jsp).
Yes, that's the one I had in mind (since more or less everyone seems
to be using it anyway). When we considered a Unicode version of the
CWB several years ago, I was under the (perhaps mistaken) impression
that we wouldn't be allowed to use ICU or a similar library with
closed-source software.
There may also be some technical issues: ICU will bloat the CWB
binaries considerably (especially if we have to link it statically),
make it more difficult to compile and distribute the CWB (at the
moment, it has very few prerequisites beyond GCC, ncurses, bison and
flex, and compiles rather easily on almost every Unix platform –
except for Ubuntu), and might make it necessary to ship a huge
runtime database (I have no idea whether ICU requires Unicode and
locale database files, but it seems quite likely). If it weren't for
this latter issue, I would probably have rewritten the CWB as a Perl
module by now. :o)
> Apparently, regular expressions etc. are quite well optimized, but
> there might be a significant speed penalty at program startup,
> which might be a bit of a bummer for CGI applications (although it
> should be possible to run cqp as some kind of daemon).
Do you know if there is a substantial startup penalty because of some
initialisation the ICU libraries have to perform? Just loading and
linking large libraries should be fairly fast if they're already
cached in RAM. I'm afraid that CQP isn't bullet-proof enough yet to
allow it to run as a demon process for a longer period of time
(though a Web interface could keep a pool of demons that are
restarted every few minutes).
cu
Stefan
More information about the CWB
mailing list