[CWB] web-interface with aligned corpora and WebCqp::Persistent

Stefan Evert stefan.evert at uos.de
Thu Feb 22 00:08:41 CET 2007


> I've heard Good Things about IBMs ICU  (http://www-306.ibm.com/ 
> software/globalization/icu/index.jsp).

Yes, that's the one I had in mind (since more or less everyone seems  
to be using it anyway).  When we considered a Unicode version of the  
CWB several years ago, I was under the (perhaps mistaken) impression  
that we wouldn't be allowed to use ICU or a similar library with  
closed-source software.

There may also be some technical issues: ICU will bloat the CWB  
binaries considerably (especially if we have to link it statically),  
make it more difficult to compile and distribute the CWB (at the  
moment, it has very few prerequisites beyond GCC, ncurses, bison and  
flex, and compiles rather easily on almost every Unix platform –  
except for Ubuntu), and might make it necessary to ship a huge  
runtime database (I have no idea whether ICU requires Unicode and  
locale database files, but it seems quite likely).  If it weren't for  
this latter issue, I would probably have rewritten the CWB as a Perl  
module by now. :o)

> Apparently, regular expressions etc. are quite well optimized, but  
> there might be a significant speed penalty at program startup,  
> which might be a bit of a bummer for CGI applications (although it  
> should be possible to run cqp as some kind of daemon).

Do you know if there is a substantial startup penalty because of some  
initialisation the ICU libraries have to perform?  Just loading and  
linking large libraries should be fairly fast if they're already  
cached in RAM.  I'm afraid that CQP isn't bullet-proof enough yet to  
allow it to run as a demon process for a longer period of time  
(though a Web interface could keep a pool of demons that are  
restarted every few minutes).

cu
Stefan




More information about the CWB mailing list