[CWB] Suggestion: user intervention in constructing an index

Ciarán Ó Duibhín coduibhin at btinternet.com
Fri Mar 23 16:05:41 CET 2018


Thanks David.

Manatee is certainly of interest to me, but I wouldn't attempt to compile it 
for Windows.  But I would consider using it on the web, if it can handle my 
twin issues of (1) generating the two attributes (word and "lemma") 
on-the-fly from one stored attribute — I will try to understand "dynamic 
attributes"; and (2) of suppressing "non-original spaces" on concordance 
lines.  All the necessary info for both is unambiguously present in the 
mark-up.

Where there is no off-the-shelf support for suppressing non-original spaces 
on output, your suggestion of glue <g/> could be useful — I see that such an 
insertion comes through CWB and CQP unchanged.  But I'm unclear how CQP 
output could be post-processed while preserving interactivity.  It seems to 
come back to being able to intervene directly in the program, CQP in this 
case.

Because my questions have been concerned with learning how to do specific 
operations, I'd rather not divert attention by discussing their 
advisability, but I think I must explain two things.

First, my objection to storing two attributes which can be derived from one 
is not based on practical grounds — the storage overheads in the one case 
and the processing overheads in the other would both be minimal — but I find 
it theoretically unsatisfactory.  A case for Occam's razor, in fact.

Second, concerns about confusing users by splitting a word into two index 
items are not well founded. A word like "seanbhean" is actually more 
frequently spelled with a hyphen between the parts, and sometimes even as 
two separate words.  My aim is to index these variations in a uniform way, 
while preserving the original spelling in each context.

As I mentioned, both these behaviours are already in use with my own 
software for Windows.  My object on this list has been to see if these 
behaviours can be reproduced in CWB, and thus to make the corpus usable on 
various computer platforms, including the web.  Perhaps (Andrew?) it would 
be easier to "share" BNCweb's treatment of "non-original spaces" (thanks for 
that useful term) if my corpus were to be set up on the web, rather than 
under Windows?

Regards,
Ciarán 



More information about the CWB mailing list