[CWB] Suggestion: user intervention in constructing an index
Hardie, Andrew
a.hardie at lancaster.ac.uk
Mon Mar 26 02:10:19 CEST 2018
Hi Ciáran,
David has said everything I would have in re: the usefulness of redundant storage. Even if you did set up a dynamic attribute in Manatee so you could keep your indexing to just one p-attribute, it would almost certainly result in slower querying (because of the need to run each word type through the conversion function every time a query is run to get the value of the dynamic attribute). More generally, there is always a tradeoff between processing time and storage; CWB currently favours waste-processing-to-save-storage rather more than is optimal - which I something that CWB 4 and the Ziggurat engine will change - as it strikes a balance based on the size of hard disks circa 1993; but the approach you want to use would strike a balance even further towards the save-storage side. Which is by way of explaining why it's not easy to do what you want in CWB as you would be working "against the grain" of how it is designed to work!
>>>
But I'm unclear how CQP
output could be post-processed while preserving interactivity. It seems to
come back to being able to intervene directly in the program, CQP in this
case.
<<<
Simple: you write a script that does the following loop:
- read input line from user standard input
- pass input line to CQP slave process (either directly, or via a library)
- if necessary, read output line(s) from CQP slave process
- modify output line(s) as per whatever requirements you have*
- print output line(s) to standard output
- print prompt for next user input.
The user then runs your script instead of running CQP.
(*) If you use one of the libraries, an easy way to do this is by specifying a "line handler" function when you call "exec/execute()" or "query()".
The process would be the same for an intervening web interface, except that instead of reading from standard input, you would read from the HTTP arguments in GET or POST, and then print HTML-formatted output to the browser. (That, in a nutshell, is what CQPweb does.)
best
Andrew.
-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Ciarán Ó Duibhín
Sent: 23 March 2018 15:06
To: Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it>
Subject: Re: [CWB] Suggestion: user intervention in constructing an index
Thanks David.
Manatee is certainly of interest to me, but I wouldn't attempt to compile it
for Windows. But I would consider using it on the web, if it can handle my
twin issues of (1) generating the two attributes (word and "lemma")
on-the-fly from one stored attribute — I will try to understand "dynamic
attributes"; and (2) of suppressing "non-original spaces" on concordance
lines. All the necessary info for both is unambiguously present in the
mark-up.
Where there is no off-the-shelf support for suppressing non-original spaces
on output, your suggestion of glue <g/> could be useful — I see that such an
insertion comes through CWB and CQP unchanged. But I'm unclear how CQP
output could be post-processed while preserving interactivity. It seems to
come back to being able to intervene directly in the program, CQP in this
case.
Because my questions have been concerned with learning how to do specific
operations, I'd rather not divert attention by discussing their
advisability, but I think I must explain two things.
First, my objection to storing two attributes which can be derived from one
is not based on practical grounds — the storage overheads in the one case
and the processing overheads in the other would both be minimal — but I find
it theoretically unsatisfactory. A case for Occam's razor, in fact.
Second, concerns about confusing users by splitting a word into two index
items are not well founded. A word like "seanbhean" is actually more
frequently spelled with a hyphen between the parts, and sometimes even as
two separate words. My aim is to index these variations in a uniform way,
while preserving the original spelling in each context.
As I mentioned, both these behaviours are already in use with my own
software for Windows. My object on this list has been to see if these
behaviours can be reproduced in CWB, and thus to make the corpus usable on
various computer platforms, including the web. Perhaps (Andrew?) it would
be easier to "share" BNCweb's treatment of "non-original spaces" (thanks for
that useful term) if my corpus were to be set up on the web, rather than
under Windows?
Regards,
Ciarán
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://liste.sslmit.unibo.it/mailman/listinfo/cwb
More information about the CWB
mailing list