[CWB] Suggestion: user intervention in constructing an index
Ciarán Ó Duibhín
coduibhin at btinternet.com
Fri Mar 23 16:05:41 CET 2018
Thanks David.
Manatee is certainly of interest to me, but I wouldn't attempt to compile it
for Windows. But I would consider using it on the web, if it can handle my
twin issues of (1) generating the two attributes (word and "lemma")
on-the-fly from one stored attribute — I will try to understand "dynamic
attributes"; and (2) of suppressing "non-original spaces" on concordance
lines. All the necessary info for both is unambiguously present in the
mark-up.
Where there is no off-the-shelf support for suppressing non-original spaces
on output, your suggestion of glue <g/> could be useful — I see that such an
insertion comes through CWB and CQP unchanged. But I'm unclear how CQP
output could be post-processed while preserving interactivity. It seems to
come back to being able to intervene directly in the program, CQP in this
case.
Because my questions have been concerned with learning how to do specific
operations, I'd rather not divert attention by discussing their
advisability, but I think I must explain two things.
First, my objection to storing two attributes which can be derived from one
is not based on practical grounds — the storage overheads in the one case
and the processing overheads in the other would both be minimal — but I find
it theoretically unsatisfactory. A case for Occam's razor, in fact.
Second, concerns about confusing users by splitting a word into two index
items are not well founded. A word like "seanbhean" is actually more
frequently spelled with a hyphen between the parts, and sometimes even as
two separate words. My aim is to index these variations in a uniform way,
while preserving the original spelling in each context.
As I mentioned, both these behaviours are already in use with my own
software for Windows. My object on this list has been to see if these
behaviours can be reproduced in CWB, and thus to make the corpus usable on
various computer platforms, including the web. Perhaps (Andrew?) it would
be easier to "share" BNCweb's treatment of "non-original spaces" (thanks for
that useful term) if my corpus were to be set up on the web, rather than
under Windows?
Regards,
Ciarán
More information about the CWB
mailing list