[CWB] Linking CWB and R

Stefan Evert stefanML at collocations.de
Thu Nov 24 16:13:46 CET 2011


> There is no reason why you couldn't create R functions that call CWB libraries. The existing CL (corpus library) is designed for exactly such undertakings. In the long term, we hope to separate out other functionality like the CQP query syntax into libraries that could be accessed in the same way. There's a lot of work to get to that point, however!

And we don't know when we'll get around to making such fundamental changes, so it may be a better idea -- at least for the time being -- to implement a CQi client that communicates with CQPserver.

Is that what you've done in your implementation, Sylvain? Or did you write your own client-server protocol? 

I've been reluctant about CQi recently since it was quickly cobbled together as an ad-hoc solution and has never been revised properly; and I'm not using it in my own research because the CWB/Perl interface is faster and more flexible. However, there does seem to be increasing interest, especially for using CWB from Java, and some people I talked to seemed to be quite happy with the current state of the CQi.

I'd very much like a CQi client for R, preferably with a few higher-level wrappers so you don't always have to execute low-level CQi calls.  The biggest hurdle, I guess, is that the code for encoding and decoding the byte stream protocol should be written in C if we want to achieve reasonable speed.

> I would be very interested, before going further, in your comments and opinions about another project : liking cwb and R though call to a CWB C library. A CWB library could be linked to a R module (and automatically installed with this module). Rather than being communicated one by one via socket, the vector elements produced by CWB would be represented in C, with a light extension to the existing code, and wrapped with a R vector. Such R vector would simply give access to the original data, without copying any data structure.

I'm not an expert on R hacking, but I don't think you're allowed to do this.  R manages its own memory, and when you create an R vector from a list of integers or strings returned by CWB, R will make a copy of the vector.  Anything that gives direct access to internal CWB data would probably require very advanced R hacking skills.

So, I'm in favour of a CQi client for R, and I'd be happy to help with it if I can spare a little bit of time.

Best wishes,
Stefan



More information about the CWB mailing list