[CWB] Suggestion: user intervention in constructing an index

Wed Mar 28 02:59:29 CEST 2018

>>>
The need to post-process kwic output in order to remove non-original spaces may not arise at all if I am correct in thinking that David's suggestion, some time ago, of <g/> as glue is actually implemented in the Sketch Engine as meaning "leave no space between the preceding and following tokens".  In cwb, the " <g/> " comes through unchanged into the kwic context, and needs post-processing to remove it.  It might be an idea to have cqp implement <g/> like that too.
<<<

No, this is not implemented in Manatee (and definitely will never be in CQP). David was pointing out that you could use such an s-attribute to indicate the presence/absence of orthographic space in the index – but to actually change it into spaces where necessary in the concordance would still be the job of the wrapper script. Basically this is an alternative to doing the same job via a binary p-attribute. Either way, translation of format is required outside the query program itself.

the underlying engine is appropriately neutral about the semantics of any attribute name… so specifying a specific s-attribute as meaning “glue” would not be something ever to build in at the system level of CQP. Front ends can of course impose whatever requirements about attribute semantics that they like.

best

Andrew.

From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Ciarán Ó Duibhín
Sent: 27 March 2018 17:32
To: Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it>
Subject: Re: [CWB] Suggestion: user intervention in constructing an index

Thank you, Vlado.  That's a really neat feature of NoSketch but for me it is better still to break "seanbhean" and "sean-bhean" each into two tokens in the vertical file:
 (1) word="sean" bzw "sean-"; demut="sean"
 (2) word="bhean"; demut="bean"
The query is then made on the "demut" p-attribute ("demut" is like how people use "lemma", but linguistically this is not a lemma). This results in:
• a search for "sean" and "bean" together will retrieve all of "seanbhean", "sean-bhean" and "sean bhean" (NoSketch "sean--bhean" can do that too)
• a search for "bean" will retrieve all of those, as well as all the other examples of "bean"; and correspondingly for a search for "sean".
That is what will best suit the lexicographical user of the corpus.

Thank you, Andrew, for showing how to display p-attributes in the kwic line; and for clarifying that CWB/Perl has not been made to work under Windows.  I have only a couple of comments.

>> It will avoid having a permanent multi-column file outside the corpus, but won't the multiple columns still exist internally in some form within the corpus?  :-(
> Yes, but it has to. If you want to store more than one item of separately-searchable information about each token – in this case, your word/demut combination – then you have to have multiple attributes.

OK, I see that cwb's architecture requires that.

> If you want to avoid at all costs multiple attributes being stored under the hood then…. you don’t want to use CWB! (Or Manatee, since that works on precisely the same principle.)

Yes,  I don't need to search on "word", and the program I use in Windows stores only "demut" in the index, and fetches the kwic contexts from a copy of the running text.  (Incidentally, this means that "non-original spaces" never enter the contexts.)  Given that cwb doesn't need a copy of the running text, the storage requirements of the two methods should be similar.

A script to post-process output from cqp before displaying it:
>
>- read input line from user standard input
>- pass input line to CQP slave process (either directly, or via a library)
>- if necessary, read output line(s) from CQP slave process
>- modify output line(s) as per whatever requirements you have*
>- print output line(s) to standard output
>- print prompt for next user input.
>
>The user then runs your script instead of running CQP.
>
>(*) If you use one of the libraries, an easy way to do this is by specifying a "line handler" function when you call "exec/execute()" >or "query()".

Many thanks for that explanation.  I should also look at existing front-ends to cqp, which must do something like this series of steps, and may allow the user some control over the fourth step.  The most promising of these may be TXM.

The need to post-process kwic output in order to remove non-original spaces may not arise at all if I am correct in thinking that David's suggestion, some time ago, of <g/> as glue is actually implemented in the Sketch Engine as meaning "leave no space between the preceding and following tokens".  In cwb, the " <g/> " comes through unchanged into the kwic context, and needs post-processing to remove it.  It might be an idea to have cqp implement <g/> like that too.

Regards,
Ciarán.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20180328/22a1c266/attachment-0001.html>