[CWB] Suggestion: user intervention in constructing an index

Ciarán Ó Duibhín coduibhin at btinternet.com
Fri Mar 16 19:17:52 CET 2018


I would like to suggest/request a facility in CWB (or its successor) where a user can intervene in the construction of an index.

I envisage allowing the user to supply a script which can receive the token, extracted from the text and destined to be placed in an index, and can transform it.  The transformed token would be placed in the index, rather than the original form.

The attached concordance output (tobar.jpg) - if attachments are allowed on the list - was made by another program, and shows an example of why I need this facility.

In my example, under the keyword "bean" are indexed/concorded several different forms, including "bean" and "bhean" and "mbean" and "Bean", among others.  As far as I am aware, this cannot be achieved with CWB at present.

In my texts, "bhean" is marked up as "b^hean", and "mbean" as "^mbean".  I would like to be able to supply a script which, in my case, would drop the character "^" and the letter immediately following it.

In displayed contexts, I would need to be able to drop the character "^h" but retain the letter following it.  This is what happens in the program which produced the screenshot.

In my case again, I would also make my script lower-case the token, bringing "Bean" into the family.

It would further be necessary to allow the script to return more than one keyword.  For example, the text might contain "seanbhean", which I encode as "sean+b^hean".  My script here would act on the character "+" and return TWO words for the index, "sean" and "bean".  Contexts would show "seanbhean", with "^" and "+" both deleted.

For contexts, it might suffice (for my needs) to give CWB a list of characters to be dropped from contexts, without going to the lengths of allowing a user script for contexts, in addition to the script for keywords.

With thanks,
Ciarán Ó Duibhín.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20180316/a355cbe7/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tobar.jpg
Type: application/octet-stream
Size: 110261 bytes
Desc: not available
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20180316/a355cbe7/attachment-0001.obj>


More information about the CWB mailing list