[CWB] Suggestion: user intervention in constructing an index

Hardie, Andrew a.hardie at lancaster.ac.uk
Mon Mar 26 01:53:35 CEST 2018


An extra word on "dynamic attributes". In regard to,

>>> I don't know if CWB supports something similar,

Let me fill in the background...

D-attributes were part of the original versions of CWB back in the early 1990s. However, instead of being functions in a shared library as in Manatee, they were defined as pipelines to an external process, such as a Perl script. This results in much slower evaluation than the Manatee approach, but doesn't require compilation of a shared library to set up. 

The original implementation is described in the very-old manual by Oli Christ, available in our doc archive here: http://cwb.sourceforge.net/files/Christ1994_TR.pdf   (see section 4.2.7). 

However, d-attributes were *taken out* about 15 years ago IIRC - before my time but I seem to recall being told that was because of security concerns (vis a vis inserting user input strings into a shell call). Stefan knows the details of the removal because he was the one who did it!

Some of the code for dynamic attributes is still present, e.g. the DynCallResult object, and the function cl_dynamic_call()  - but this is because the same code supports the "builtin" functions like f(), distance(), lbound(), rbound() etc. Everything relating to the actual pipeline-based custom functions is gone.

Obviously, a user-defined function facility would be a very desirable thing to have... though in terms of implementation, I think neither the ancient-CWB mechanism nor the Manatee mechanism is optimal from the point of view of user-friendliness! 

Hope this additional background info is useful.

best

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of David Lukeš
Sent: 21 March 2018 12:29
To: Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it>; Ciarán Ó Duibhín <coduibhin at btinternet.com>
Subject: Re: [CWB] Suggestion: user intervention in constructing an index

 > won't the multiple columns still exist internally in some form within the
 > corpus? :-(

This sounds like you want something akin to dynamic attributes in 
Manatee (the
database backend of (No)SketchEngine, which was originally inspired by CWB).
These perform on-the-fly conversion based on regular attributes + a 
conversion
function (defined in a shared library, typically written in C). See e.g.:

<https://www.sketchengine.co.uk/dynamic-attributes/>
<https://www.sketchengine.co.uk/corpus-configuration-file-all-features/> 
(search for Dynamic attributes)

I don't know if CWB supports something similar, but even if it did, 
there's no
reason not to go the way Andrew suggested (additional regular 
attributes). Data
preparation and indexing is a step that is supposed to make your data 
searchable
in a fast and user-friendly way. If doing that involves precomputing some
attributes to make that easier, it's not only perfectly fine -- it's the 
right
way to go.

I assume the CWB index has ways of storing the data in a space-efficient 
way, if
that's what worries you, but in general, getting faster and more convenient
searching at the expense of storing more precomputed information about 
the data
is basically the definition of indexing as a concept, so it doesn't 
really make
sense to consider this as a drawback :) Unless of course you hit the 
physical
limitations of your hard drive, which however I don't suppose is a 
problem here?

 > I'm definitely interested in copying the BNCweb idea.

Another option would be to use empty structures to act as typographical 
"glue":

   sean
   <g/>
   bean

And then you'd write a frontend which interprets <g/> in the results as 
"don't
display any whitespace between these two tokens).

BUT (and these are two big buts):

1. As Andrew mentioned, unintuitive tokenization is potentially 
confusing for
users, *especially* if you also remove visual cues which would signal it 
in the
output. How are users supposed to know they should search for the tokens 
"sean+"
and "bhean" if all they ever see is "seanbhean"?

2. Conversely, if the only user of the corpus is you, I think writing a 
custom
frontend is much more trouble than it's worth. You can always 
postprocess the
results if you need.

Best,

David

P.S. I don't know whether it's possible to compile and run Manatee under
Windows, but even if it is, I wouldn't count on it being hassle-free 
either.
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://liste.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list