[CWB] Suggestion: user intervention in constructing an index

Hardie, Andrew a.hardie at lancaster.ac.uk
Mon Mar 26 02:10:19 CEST 2018


Hi Ciáran,

David has said everything I would have  in re: the usefulness of redundant storage. Even if you did set up a dynamic attribute in Manatee so you could keep your indexing to just one p-attribute, it would almost certainly result in slower querying (because of the need to run each word type through the conversion function every time a query is run to get the value of the dynamic attribute). More generally, there is always a tradeoff between processing time and storage; CWB currently favours waste-processing-to-save-storage rather more than  is optimal - which I something that CWB 4 and the Ziggurat engine will change - as it strikes a balance based on the size of hard disks circa 1993; but the approach you want to use would strike a balance even further towards the save-storage side.  Which is by way of explaining why it's not easy to do what you want in CWB as you would be working "against the grain" of how it is designed to work!

>>>
But I'm unclear how CQP 
output could be post-processed while preserving interactivity.  It seems to 
come back to being able to intervene directly in the program, CQP in this 
case.
<<<

Simple: you write a script that does the following loop:

- read input line from user standard input
- pass input line to CQP slave process (either directly, or via a library)
- if necessary, read output line(s) from CQP slave process
- modify output line(s) as per whatever requirements you have*
- print output line(s) to standard output
- print prompt for next user input.

The user then runs your script instead of running CQP.

(*) If you use one of the libraries, an easy way to do this is by specifying a "line handler" function when you call "exec/execute()" or "query()".

The process would be the same for an intervening web interface, except that instead of reading from standard input, you would read from the HTTP arguments in GET or POST, and then print HTML-formatted output to the browser. (That, in a nutshell, is what CQPweb does.)

best

Andrew.


-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Ciarán Ó Duibhín
Sent: 23 March 2018 15:06
To: Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it>
Subject: Re: [CWB] Suggestion: user intervention in constructing an index

Thanks David.

Manatee is certainly of interest to me, but I wouldn't attempt to compile it 
for Windows.  But I would consider using it on the web, if it can handle my 
twin issues of (1) generating the two attributes (word and "lemma") 
on-the-fly from one stored attribute — I will try to understand "dynamic 
attributes"; and (2) of suppressing "non-original spaces" on concordance 
lines.  All the necessary info for both is unambiguously present in the 
mark-up.

Where there is no off-the-shelf support for suppressing non-original spaces 
on output, your suggestion of glue <g/> could be useful — I see that such an 
insertion comes through CWB and CQP unchanged.  But I'm unclear how CQP 
output could be post-processed while preserving interactivity.  It seems to 
come back to being able to intervene directly in the program, CQP in this 
case.

Because my questions have been concerned with learning how to do specific 
operations, I'd rather not divert attention by discussing their 
advisability, but I think I must explain two things.

First, my objection to storing two attributes which can be derived from one 
is not based on practical grounds — the storage overheads in the one case 
and the processing overheads in the other would both be minimal — but I find 
it theoretically unsatisfactory.  A case for Occam's razor, in fact.

Second, concerns about confusing users by splitting a word into two index 
items are not well founded. A word like "seanbhean" is actually more 
frequently spelled with a hyphen between the parts, and sometimes even as 
two separate words.  My aim is to index these variations in a uniform way, 
while preserving the original spelling in each context.

As I mentioned, both these behaviours are already in use with my own 
software for Windows.  My object on this list has been to see if these 
behaviours can be reproduced in CWB, and thus to make the corpus usable on 
various computer platforms, including the web.  Perhaps (Andrew?) it would 
be easier to "share" BNCweb's treatment of "non-original spaces" (thanks for 
that useful term) if my corpus were to be set up on the web, rather than 
under Windows?

Regards,
Ciarán 

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://liste.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list