[CWB] Suggestion: user intervention in constructing an index

Ciarán Ó Duibhín coduibhin at btinternet.com
Tue Mar 20 15:22:11 CET 2018


Thanks, Andrew, for those constructive ideas.


I have experimented with your second suggestion of adding a "lemma" column to the input.  (For info, what is marked up in my text is *partial* lemmatisation, covering changes at the beginning of words, so I'll call it "demut(ation)" rather than "lemma".  Full lemmatisation would require attention to terminal inflection as well.)


So, I could generate extra columns, like this:
            b^hean        bean           bhean
            ^mbean        bean           mbean
            Bean          bean           Bean
The first column is what is in the text; this column can be removed from the file when the other two have been generated from it.  The second is the index term ("demut").  The third is what I want to see in contexts ("word").


While this will work, I am not comfortable with the idea of storing two columns to hold things which (unlike with normal lemmatisation) can be automatically generated from one column — during the indexing process, if access by a user-supplied script were usable there, acting on the text shown in column 1 to produce what is shown in column 2.


Turning from the index keywords to the contexts, I am unsure how the extra-column approach will handle the case where a single token of text is to be split into two index items (column 2), which should be displayed in context without any space between them.
            sean+b^hean   sean           sean+
                          bean           bhean
Here I have used a + sign at the end of an item in column 3, to show that I wish to have no space inserted in the context before the following word. Is there already a way of doing this in CWB?  If not, access by a user-supplied script to the production of contexts could act on the text shown in column 1 to produce "seanbhean".


Software of my own gives proof of concept of processing text marked up as in column 1 above, allowing interpretation of the markup during both the extraction of indexing terms and the production of contexts, and I would still like the CWB developers to consider my request for the facility to execute a user-supplied script at these two points in the process.


Many thanks again for your advice,
Ciarán.
  ----- Original Message ----- 
  From: Hardie, Andrew 
  To: Open source development of the Corpus WorkBench 
  Sent: Friday, March 16, 2018 7:04 PM
  Subject: Re: [CWB] Suggestion: user intervention in constructing an index


  Hi Ciarán,

   

  There are two answers here… 

   

  First, it most certainly is already possible to adjust the form of the words as they are indexed. Simply prepare a script to make the change and pipe your files through it into the cwb-encode standard input (cwb-encode reads from standard input if no files are specified).

   

  (Or just run your converter separately on the data to create a modified version, and then index that, to avoid mucking about with pipes!)

   

  Second, although that is the direct answer to your question, actually it is probably not “the right thing” to do. What you are talking about here is effectively lemmatisation – since bean/bhean/mbean are different forms of a single lemma, converting them all to “bean” means lemmatising. So what you’re talking about is indexing the lemma in place of the wordform. But the “right way” to do this in CWB is to add the lemma as a separate attribute – allowing the lemma to be queried, as well as / instead of the word.

   

  This means adding the lemma as a second column of the input file, like thus:

   

  Bean   bean

  (…)

  ar       ar

  mbean bean

  (…)

  mo     mo

  bhean  bean

   

  (and likewise for plural forms of bean, etc etc.)

   

  I don’t know what lemmatisation tool is considered standard for Gaelic at the moment, but I guess there must be options out there?  

   

  You can then do queries like this:

   

       [lemma="bean"];

   

  … to retrieve bean/mbean/bhean all at the same time.

   

  The advantage of encoding the lemma as a separate attribute is that the concordance can display the actual form that appears in the word-attribute, even if you have searched on the lemma-attribute. Whereas if you replace the word forms, you don’t get that.

   

  Hope this helps!

   

  best

   

  Andrew.

   

   

  From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Ciarán Ó Duibhín
  Sent: 16 March 2018 18:18
  To: cwb at sslmit.unibo.it
  Subject: [CWB] Suggestion: user intervention in constructing an index

   

  I would like to suggest/request a facility in CWB (or its successor) where a user can intervene in the construction of an index.

   

  I envisage allowing the user to supply a script which can receive the token, extracted from the text and destined to be placed in an index, and can transform it.  The transformed token would be placed in the index, rather than the original form.

   

  The attached concordance output (tobar.jpg) — if attachments are allowed on the list — was made by another program, and shows an example of why I need this facility.

   

  In my example, under the keyword "bean" are indexed/concorded several different forms, including "bean" and "bhean" and "mbean" and "Bean", among others.  As far as I am aware, this cannot be achieved with CWB at present.

   

  In my texts, "bhean" is marked up as "b^hean", and "mbean" as "^mbean".  I would like to be able to supply a script which, in my case, would drop the character "^" and the letter immediately following it.

   

  In displayed contexts, I would need to be able to drop the character "^h" but retain the letter following it.  This is what happens in the program which produced the screenshot.

   

  In my case again, I would also make my script lower-case the token, bringing "Bean" into the family.

   

  It would further be necessary to allow the script to return more than one keyword.  For example, the text might contain "seanbhean", which I encode as "sean+b^hean".  My script here would act on the character "+" and return TWO words for the index, "sean" and "bean".  Contexts would show "seanbhean", with "^" and "+" both deleted.

   

  For contexts, it might suffice (for my needs) to give CWB a list of characters to be dropped from contexts, without going to the lengths of allowing a user script for contexts, in addition to the script for keywords.

   

  With thanks,

  Ciarán Ó Duibhín.

   

   



------------------------------------------------------------------------------


  _______________________________________________
  CWB mailing list
  CWB at sslmit.unibo.it
  http://liste.sslmit.unibo.it/mailman/listinfo/cwb
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20180320/ff9337a0/attachment-0001.html>


More information about the CWB mailing list