[CWB] Linking CWB and R

Sylvain Loiseau sylvain.loiseau at wanadoo.fr
Thu Nov 24 17:01:44 CET 2011


Le 24 nov. 2011 à 16:13, Stefan Evert a écrit :

> 
>> There is no reason why you couldn't create R functions that call CWB libraries. The existing CL (corpus library) is designed for exactly such undertakings. In the long term, we hope to separate out other functionality like the CQP query syntax into libraries that could be accessed in the same way. There's a lot of work to get to that point, however!
> 
> And we don't know when we'll get around to making such fundamental changes, so it may be a better idea -- at least for the time being -- to implement a CQi client that communicates with CQPserver.
> 
> Is that what you've done in your implementation, Sylvain? Or did you write your own client-server protocol? 

I wrote an R implementation just mimicking the perl implementation.

It's available here :

    https://r-forge.r-project.org/projects/rcwb/

(at the bottom of the page, "SCM repository".)

But I'm afraid is full of bug and not very clean/efficient.

If you source this three files:

> source("client.R")
> source("constantes.R")
> source("server.R")

you can interact with the cqpserver using the CQI protocol :

> con <- get_cwb(server_options="-r /path/to/your/registry");
> cqi_attributes("YOUR_CORPUS", "p", con) # ask for positional attribute returned as a character vector.
[1] "word"  "pos"   "func"  "lemma" "id"   

The cqpserver is launched by the first command.

The rest of the files in the same directory try to define a more high-level set of CWB objects (corpus, attribute...) but it's not satisfactory I think up to now. The file test.R shows usage of these objects.

> I've been reluctant about CQi recently since it was quickly cobbled together as an ad-hoc solution and has never been revised properly; and I'm not using it in my own research because the CWB/Perl interface is faster and more flexible. However, there does seem to be increasing interest, especially for using CWB from Java, and some people I talked to seemed to be quite happy with the current state of the CQi.
> 
> I'd very much like a CQi client for R, preferably with a few higher-level wrappers so you don't always have to execute low-level CQi calls.
> The biggest hurdle, I guess, is that the code for encoding and decoding the byte stream protocol should be written in C if we want to achieve reasonable speed.

If a layer of glue in C have to be added, don't you think this effort may be better invested in linking directly the two, without the stream protocol, in some way?

>> I would be very interested, before going further, in your comments and opinions about another project : liking cwb and R though call to a CWB C library. A CWB library could be linked to a R module (and automatically installed with this module). Rather than being communicated one by one via socket, the vector elements produced by CWB would be represented in C, with a light extension to the existing code, and wrapped with a R vector. Such R vector would simply give access to the original data, without copying any data structure.
> 
> I'm not an expert on R hacking, but I don't think you're allowed to do this.  R manages its own memory, and when you create an R vector from a list of integers or strings returned by CWB, R will make a copy of the vector.  Anything that gives direct access to internal CWB data would probably require very advanced R hacking skills.

If the CWB code may be packaged, extended with some linking code, and the whole compiled into a library, it may not that hard (?).

A C array in a library may be wrapped in order to be seen as an R vector by R code using this library, this is what I infer from : http://cran.r-project.org/doc/manuals/R-exts.html#Interface-functions-_002eC-and-_002eFortran

Best,
Sylvain


More information about the CWB mailing list