[CWB] Suggestion: user intervention in constructing an index

David Lukeš david.lukes at ff.cuni.cz
Wed Mar 21 13:29:19 CET 2018


 > won't the multiple columns still exist internally in some form within the
 > corpus? :-(

This sounds like you want something akin to dynamic attributes in 
Manatee (the
database backend of (No)SketchEngine, which was originally inspired by CWB).
These perform on-the-fly conversion based on regular attributes + a 
conversion
function (defined in a shared library, typically written in C). See e.g.:

<https://www.sketchengine.co.uk/dynamic-attributes/>
<https://www.sketchengine.co.uk/corpus-configuration-file-all-features/> 
(search for Dynamic attributes)

I don't know if CWB supports something similar, but even if it did, 
there's no
reason not to go the way Andrew suggested (additional regular 
attributes). Data
preparation and indexing is a step that is supposed to make your data 
searchable
in a fast and user-friendly way. If doing that involves precomputing some
attributes to make that easier, it's not only perfectly fine -- it's the 
right
way to go.

I assume the CWB index has ways of storing the data in a space-efficient 
way, if
that's what worries you, but in general, getting faster and more convenient
searching at the expense of storing more precomputed information about 
the data
is basically the definition of indexing as a concept, so it doesn't 
really make
sense to consider this as a drawback :) Unless of course you hit the 
physical
limitations of your hard drive, which however I don't suppose is a 
problem here?

 > I'm definitely interested in copying the BNCweb idea.

Another option would be to use empty structures to act as typographical 
"glue":

   sean
   <g/>
   bean

And then you'd write a frontend which interprets <g/> in the results as 
"don't
display any whitespace between these two tokens).

BUT (and these are two big buts):

1. As Andrew mentioned, unintuitive tokenization is potentially 
confusing for
users, *especially* if you also remove visual cues which would signal it 
in the
output. How are users supposed to know they should search for the tokens 
"sean+"
and "bhean" if all they ever see is "seanbhean"?

2. Conversely, if the only user of the corpus is you, I think writing a 
custom
frontend is much more trouble than it's worth. You can always 
postprocess the
results if you need.

Best,

David

P.S. I don't know whether it's possible to compile and run Manatee under
Windows, but even if it is, I wouldn't count on it being hassle-free 
either.


More information about the CWB mailing list