[CWB] Suggestion: user intervention in constructing an index

Fri Mar 16 20:04:03 CET 2018

Hi Ciarán,

There are two answers here…

First, it most certainly is already possible to adjust the form of the words as they are indexed. Simply prepare a script to make the change and pipe your files through it into the cwb-encode standard input (cwb-encode reads from standard input if no files are specified).

(Or just run your converter separately on the data to create a modified version, and then index that, to avoid mucking about with pipes!)

Second, although that is the direct answer to your question, actually it is probably not “the right thing” to do. What you are talking about here is effectively lemmatisation – since bean/bhean/mbean are different forms of a single lemma, converting them all to “bean” means lemmatising. So what you’re talking about is indexing the lemma in place of the wordform. But the “right way” to do this in CWB is to add the lemma as a separate attribute – allowing the lemma to be queried, as well as / instead of the word.

This means adding the lemma as a second column of the input file, like thus:

Bean   bean
(…)
ar       ar
mbean bean
(…)
mo     mo
bhean  bean

(and likewise for plural forms of bean, etc etc.)

I don’t know what lemmatisation tool is considered standard for Gaelic at the moment, but I guess there must be options out there?

You can then do queries like this:

     [lemma="bean"];

… to retrieve bean/mbean/bhean all at the same time.

The advantage of encoding the lemma as a separate attribute is that the concordance can display the actual form that appears in the word-attribute, even if you have searched on the lemma-attribute. Whereas if you replace the word forms, you don’t get that.

Hope this helps!

best

Andrew.

From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Ciarán Ó Duibhín
Sent: 16 March 2018 18:18
To: cwb at sslmit.unibo.it
Subject: [CWB] Suggestion: user intervention in constructing an index

I would like to suggest/request a facility in CWB (or its successor) where a user can intervene in the construction of an index.

I envisage allowing the user to supply a script which can receive the token, extracted from the text and destined to be placed in an index, and can transform it.  The transformed token would be placed in the index, rather than the original form.

The attached concordance output (tobar.jpg) — if attachments are allowed on the list — was made by another program, and shows an example of why I need this facility.

In my example, under the keyword "bean" are indexed/concorded several different forms, including "bean" and "bhean" and "mbean" and "Bean", among others.  As far as I am aware, this cannot be achieved with CWB at present.

In my texts, "bhean" is marked up as "b^hean", and "mbean" as "^mbean".  I would like to be able to supply a script which, in my case, would drop the character "^" and the letter immediately following it.

In displayed contexts, I would need to be able to drop the character "^h" but retain the letter following it.  This is what happens in the program which produced the screenshot.

In my case again, I would also make my script lower-case the token, bringing "Bean" into the family.

It would further be necessary to allow the script to return more than one keyword.  For example, the text might contain "seanbhean", which I encode as "sean+b^hean".  My script here would act on the character "+" and return TWO words for the index, "sean" and "bean".  Contexts would show "seanbhean", with "^" and "+" both deleted.

For contexts, it might suffice (for my needs) to give CWB a list of characters to be dropped from contexts, without going to the lengths of allowing a user script for contexts, in addition to the script for keywords.

With thanks,
Ciarán Ó Duibhín.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20180316/ef4a0dbc/attachment-0001.html>