<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

<HTML xmlns="http://www.w3.org/TR/REC-html40" xmlns:v = 

"urn:schemas-microsoft-com:vml" xmlns:o = 

"urn:schemas-microsoft-com:office:office" xmlns:w = 

"urn:schemas-microsoft-com:office:word" xmlns:m = 

"http://schemas.microsoft.com/office/2004/12/omml"><HEAD>

<META content="text/html; charset=utf-8" http-equiv=Content-Type>

<META name=GENERATOR content="MSHTML 9.00.8112.16872">

<STYLE>@font-face {

        font-family: Cambria Math;

}

@font-face {

        font-family: Calibri;

}

@font-face {

        font-family: Verdana;

}

@page WordSection1 {size: 612.0pt 792.0pt; margin: 72.0pt 72.0pt 72.0pt 72.0pt; }

P.MsoNormal {

        MARGIN: 0cm 0cm 0pt; FONT-FAMILY: "Times New Roman",serif; FONT-SIZE: 12pt

}

LI.MsoNormal {

        MARGIN: 0cm 0cm 0pt; FONT-FAMILY: "Times New Roman",serif; FONT-SIZE: 12pt

}

DIV.MsoNormal {

        MARGIN: 0cm 0cm 0pt; FONT-FAMILY: "Times New Roman",serif; FONT-SIZE: 12pt

}

A:link {

        COLOR: blue; TEXT-DECORATION: underline; mso-style-priority: 99

}

SPAN.MsoHyperlink {

        COLOR: blue; TEXT-DECORATION: underline; mso-style-priority: 99

}

A:visited {

        COLOR: purple; TEXT-DECORATION: underline; mso-style-priority: 99

}

SPAN.MsoHyperlinkFollowed {

        COLOR: purple; TEXT-DECORATION: underline; mso-style-priority: 99

}

P.msonormal0 {

        FONT-FAMILY: "Times New Roman",serif; MARGIN-LEFT: 0cm; FONT-SIZE: 12pt; MARGIN-RIGHT: 0cm; mso-style-name: msonormal; mso-margin-top-alt: auto; mso-margin-bottom-alt: auto

}

LI.msonormal0 {

        FONT-FAMILY: "Times New Roman",serif; MARGIN-LEFT: 0cm; FONT-SIZE: 12pt; MARGIN-RIGHT: 0cm; mso-style-name: msonormal; mso-margin-top-alt: auto; mso-margin-bottom-alt: auto

}

DIV.msonormal0 {

        FONT-FAMILY: "Times New Roman",serif; MARGIN-LEFT: 0cm; FONT-SIZE: 12pt; MARGIN-RIGHT: 0cm; mso-style-name: msonormal; mso-margin-top-alt: auto; mso-margin-bottom-alt: auto

}

SPAN.EmailStyle18 {

        FONT-STYLE: normal; FONT-FAMILY: "Verdana",sans-serif; COLOR: #1f497d; FONT-WEIGHT: normal; TEXT-DECORATION: none; mso-style-type: personal-reply

}

.MsoChpDefault {

        FONT-SIZE: 10pt; mso-style-type: export-only

}

DIV.WordSection1 {

        page: WordSection1

}

</STYLE>

<!--[if gte mso 9]><xml>

<o:shapedefaults v:ext="edit" spidmax="1026" />

</xml><![endif]--><!--[if gte mso 9]><xml>

<o:shapelayout v:ext="edit">

<o:idmap v:ext="edit" data="1" />

</o:shapelayout></xml><![endif]--></HEAD>

<BODY lang=EN-GB bgColor=white vLink=purple link=blue>

<DIV><FONT size=2 face="Courier New">Thanks, Andrew, for those constructive 

ideas.</FONT></DIV>

<DIV><FONT size=2 face="Courier New"></FONT>&nbsp;</DIV>

<DIV><FONT size=2 face="Courier New"></FONT>&nbsp;</DIV>

<DIV><FONT size=2 face="Courier New">I have experimented with your second 

suggestion of adding a "lemma" column to the input.&nbsp; (For info,&nbsp;what 

is marked up&nbsp;in my text is *partial* lemmatisation, covering changes at the 

beginning of words,&nbsp;so I'll call it "demut(ation)" rather than 

"lemma".&nbsp;&nbsp;Full lemmatisation&nbsp;would require attention to terminal 

inflection as well.)</FONT></DIV>

<DIV><FONT size=2 face="Courier New"></FONT>&nbsp;</DIV>

<DIV><FONT size=2 face="Courier New"></FONT>&nbsp;</DIV>

<DIV><FONT size=2 face="Courier New">So, I could generate extra columns, like 

this:</FONT></DIV>

<DIV><FONT size=2 

face="Courier New">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 

b^hean&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 

bean&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 

bhean</FONT></DIV>

<DIV><FONT size=2 

face="Courier New">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 

^mbean&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 

bean&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 

mbean</FONT></DIV>

<DIV><FONT size=2 

face="Courier New">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 

Bean&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 

bean&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 

Bean</FONT></DIV>

<DIV><FONT size=2 face="Courier New">The first column is what is in the 

text;&nbsp;this column&nbsp;can be removed from the&nbsp;file&nbsp;when the 

other two have been generated from it.&nbsp; The second is the index term 

("demut").&nbsp; The third is what I want to see in contexts 

("word").</FONT></DIV>

<DIV><FONT size=2 face="Courier New"></FONT>&nbsp;</DIV>

<DIV><FONT size=2 face="Courier New"></FONT>&nbsp;</DIV>

<DIV><FONT size=2 face="Courier New">While this will work, <FONT size=2>I am not 

comfortable with the idea of storing two columns to hold things which 

(unlike&nbsp;with normal lemmatisation)&nbsp;can be automatically generated from 

one column —&nbsp;during the indexing process, if access by a&nbsp;user-supplied 

script were usable there, acting on the text shown in column 1&nbsp;to produce 

what is shown in column 2.</FONT></FONT></DIV>

<DIV><FONT size=2 face="Courier New"></FONT><FONT face="Courier New"><FONT 

size=2></FONT></FONT>&nbsp;</DIV>

<DIV><FONT size=2 face="Courier New"></FONT>&nbsp;</DIV>

<DIV><FONT face="Courier New"><FONT size=2>Turning from the index keywords to 

the contexts,</FONT></FONT><FONT face="Courier New"><FONT size=2> I am 

unsure&nbsp;how the extra-column approach will handle the case where a single 

token of text is to be split into two index items (column 2), which should be 

displayed in context without any space between them.</FONT></FONT></DIV>

<DIV><FONT size=2 

face="Courier New">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 

sean+b^hean&nbsp;&nbsp; 

sean&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 

sean+</FONT></DIV>

<DIV><FONT size=2 

face="Courier New">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;bean&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 

&nbsp;bhean</FONT></DIV>

<DIV><FONT size=2 face="Courier New">Here I have used a + sign at the end of an 

item in column 3, to show that I wish to have no space inserted in the context 

before the following word. Is there already a way of doing&nbsp;this&nbsp;in 

CWB?&nbsp; If not, access by a&nbsp;user-supplied script&nbsp;to the production 

of contexts could act on the text shown in column 1 to 

produce&nbsp;"seanbhean".</FONT></DIV>

<DIV><FONT size=2 face="Courier New"></FONT><FONT size=2 

face="Courier New"></FONT>&nbsp;</DIV>

<DIV><FONT size=2 face="Courier New"></FONT>&nbsp;</DIV>

<DIV><FONT size=2 face="Courier New">Software of my own gives proof of 

concept&nbsp;of&nbsp;processing text marked up as&nbsp;in column 1 above, 

allowing interpretation of the markup during both the extraction of indexing 

terms and the production of contexts,&nbsp;and I would still like the CWB 

developers to consider my request for the facility to execute&nbsp;a 

user-supplied script at these two points in the process.</FONT><FONT 

size=2></FONT></DIV>

<DIV><FONT size=2 face="Courier New"></FONT>&nbsp;</DIV>

<DIV><FONT size=2 face="Courier New"></FONT>&nbsp;</DIV>

<DIV><FONT size=2 face="Courier New">Many thanks again for your 

advice,</FONT></DIV>

<DIV><FONT size=2 face="Courier New">Ciarán.</FONT></DIV>

<BLOCKQUOTE 

style="BORDER-LEFT: #000000 2px solid; PADDING-LEFT: 5px; PADDING-RIGHT: 0px; MARGIN-LEFT: 5px; MARGIN-RIGHT: 0px">

  <DIV style="FONT: 10pt arial">----- Original Message ----- </DIV>

  <DIV 

  style="FONT: 10pt arial; BACKGROUND: #e4e4e4; font-color: black"><B>From:</B> 

  <A title=a.hardie@lancaster.ac.uk 

  href="mailto:a.hardie@lancaster.ac.uk">Hardie, Andrew</A> </DIV>

  <DIV style="FONT: 10pt arial"><B>To:</B> <A title=cwb@sslmit.unibo.it 

  href="mailto:cwb@sslmit.unibo.it">Open source development of the Corpus 

  WorkBench</A> </DIV>

  <DIV style="FONT: 10pt arial"><B>Sent:</B> Friday, March 16, 2018 7:04 

PM</DIV>

  <DIV style="FONT: 10pt arial"><B>Subject:</B> Re: [CWB] Suggestion: user 

  intervention in constructing an index</DIV>

  <DIV><FONT size=2 face=Arial></FONT><FONT size=2 face=Arial></FONT><FONT 

  size=2 face=Arial></FONT><FONT size=2 face=Arial></FONT><FONT size=2 

  face=Arial></FONT><FONT size=2 face=Arial></FONT><FONT size=2 

  face=Arial></FONT><FONT size=2 face=Arial></FONT><FONT size=2 

  face=Arial></FONT><FONT size=2 face=Arial></FONT><FONT size=2 

  face=Arial></FONT><BR></DIV>

  <DIV class=WordSection1>

  <P class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Verdana',sans-serif; COLOR: #1f497d; FONT-SIZE: 10pt; mso-fareast-language: EN-US">Hi</SPAN> 

  <SPAN 

  style="FONT-FAMILY: 'Verdana',sans-serif; COLOR: #1f497d; FONT-SIZE: 10pt; mso-fareast-language: EN-US">Ciarán,<o:p></o:p></SPAN></P>

  <P class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Verdana',sans-serif; COLOR: #1f497d; FONT-SIZE: 10pt; mso-fareast-language: EN-US"><o:p>&nbsp;</o:p></SPAN></P>

  <P class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Verdana',sans-serif; COLOR: #1f497d; FONT-SIZE: 10pt; mso-fareast-language: EN-US">There 

  are two answers here… <o:p></o:p></SPAN></P>

  <P class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Verdana',sans-serif; COLOR: #1f497d; FONT-SIZE: 10pt; mso-fareast-language: EN-US"><o:p>&nbsp;</o:p></SPAN></P>

  <P class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Verdana',sans-serif; COLOR: #1f497d; FONT-SIZE: 10pt; mso-fareast-language: EN-US">First, 

  it most certainly is already possible to adjust the form of the words as they 

  are indexed. Simply prepare a script to make the change and pipe your files 

  through it into the cwb-encode standard input (cwb-encode reads from standard 

  input if no files are specified).<o:p></o:p></SPAN></P>

  <P class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Verdana',sans-serif; COLOR: #1f497d; FONT-SIZE: 10pt; mso-fareast-language: EN-US"><o:p>&nbsp;</o:p></SPAN></P>

  <P class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Verdana',sans-serif; COLOR: #1f497d; FONT-SIZE: 10pt; mso-fareast-language: EN-US">(Or 

  just run your converter separately on the data to create a modified version, 

  and then index that, to avoid mucking about with pipes!)<o:p></o:p></SPAN></P>

  <P class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Verdana',sans-serif; COLOR: #1f497d; FONT-SIZE: 10pt; mso-fareast-language: EN-US"><o:p>&nbsp;</o:p></SPAN></P>

  <P class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Verdana',sans-serif; COLOR: #1f497d; FONT-SIZE: 10pt; mso-fareast-language: EN-US">Second, 

  although that is the direct answer to your question, actually it is probably 

  not “the right thing” to do. What you are talking about here is effectively 

  lemmatisation – since <I>bean/bhean/mbean</I> are different forms of a single 

  lemma, converting them all to “bean” means lemmatising. So what you’re talking 

  about is indexing the lemma in place of the wordform. But the “right way” to 

  do this in CWB is to add the lemma as a separate attribute – allowing the 

  lemma to be queried, as well as / instead of the word.<o:p></o:p></SPAN></P>

  <P class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Verdana',sans-serif; COLOR: #1f497d; FONT-SIZE: 10pt; mso-fareast-language: EN-US"><o:p>&nbsp;</o:p></SPAN></P>

  <P class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Verdana',sans-serif; COLOR: #1f497d; FONT-SIZE: 10pt; mso-fareast-language: EN-US">This 

  means adding the lemma as a second column of the input file, like 

  thus:<o:p></o:p></SPAN></P>

  <P class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Verdana',sans-serif; COLOR: #1f497d; FONT-SIZE: 10pt; mso-fareast-language: EN-US"><o:p>&nbsp;</o:p></SPAN></P>

  <P style="MARGIN-LEFT: 36pt" class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Verdana',sans-serif; COLOR: #1f497d; FONT-SIZE: 10pt; mso-fareast-language: EN-US">Bean&nbsp;&nbsp; 

  bean<o:p></o:p></SPAN></P>

  <P style="MARGIN-LEFT: 36pt" class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Verdana',sans-serif; COLOR: #1f497d; FONT-SIZE: 10pt; mso-fareast-language: EN-US">(…)<o:p></o:p></SPAN></P>

  <P style="MARGIN-LEFT: 36pt" class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Verdana',sans-serif; COLOR: #1f497d; FONT-SIZE: 10pt; mso-fareast-language: EN-US">ar&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 

  ar<o:p></o:p></SPAN></P>

  <P style="MARGIN-LEFT: 36pt" class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Verdana',sans-serif; COLOR: #1f497d; FONT-SIZE: 10pt; mso-fareast-language: EN-US">mbean 

  bean<o:p></o:p></SPAN></P>

  <P style="MARGIN-LEFT: 36pt" class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Verdana',sans-serif; COLOR: #1f497d; FONT-SIZE: 10pt; mso-fareast-language: EN-US">(…)<o:p></o:p></SPAN></P>

  <P style="MARGIN-LEFT: 36pt" class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Verdana',sans-serif; COLOR: #1f497d; FONT-SIZE: 10pt; mso-fareast-language: EN-US">mo&nbsp;&nbsp;&nbsp;&nbsp; 

  mo<o:p></o:p></SPAN></P>

  <P style="MARGIN-LEFT: 36pt" class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Verdana',sans-serif; COLOR: #1f497d; FONT-SIZE: 10pt; mso-fareast-language: EN-US">bhean&nbsp; 

  bean<o:p></o:p></SPAN></P>

  <P class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Verdana',sans-serif; COLOR: #1f497d; FONT-SIZE: 10pt; mso-fareast-language: EN-US"><o:p>&nbsp;</o:p></SPAN></P>

  <P class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Verdana',sans-serif; COLOR: #1f497d; FONT-SIZE: 10pt; mso-fareast-language: EN-US">(and 

  likewise for plural forms of <I>bean</I>, etc etc.)<o:p></o:p></SPAN></P>

  <P class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Verdana',sans-serif; COLOR: #1f497d; FONT-SIZE: 10pt; mso-fareast-language: EN-US"><o:p>&nbsp;</o:p></SPAN></P>

  <P class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Verdana',sans-serif; COLOR: #1f497d; FONT-SIZE: 10pt; mso-fareast-language: EN-US">I 

  don’t know what lemmatisation tool is considered standard for Gaelic at the 

  moment, but I guess there must be options out there? 

  &nbsp;<o:p></o:p></SPAN></P>

  <P class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Verdana',sans-serif; COLOR: #1f497d; FONT-SIZE: 10pt; mso-fareast-language: EN-US"><o:p>&nbsp;</o:p></SPAN></P>

  <P class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Verdana',sans-serif; COLOR: #1f497d; FONT-SIZE: 10pt; mso-fareast-language: EN-US">You 

  can then do queries like this:<o:p></o:p></SPAN></P>

  <P class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Verdana',sans-serif; COLOR: #1f497d; FONT-SIZE: 10pt; mso-fareast-language: EN-US"><o:p>&nbsp;</o:p></SPAN></P>

  <P class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Verdana',sans-serif; COLOR: #1f497d; FONT-SIZE: 10pt; mso-fareast-language: EN-US">&nbsp;&nbsp;&nbsp;&nbsp; 

  [lemma="bean"];<o:p></o:p></SPAN></P>

  <P class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Verdana',sans-serif; COLOR: #1f497d; FONT-SIZE: 10pt; mso-fareast-language: EN-US"><o:p>&nbsp;</o:p></SPAN></P>

  <P class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Verdana',sans-serif; COLOR: #1f497d; FONT-SIZE: 10pt; mso-fareast-language: EN-US">… 

  to retrieve <I>bean/mbean/bhean</I> all at the same 

time.<o:p></o:p></SPAN></P>

  <P class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Verdana',sans-serif; COLOR: #1f497d; FONT-SIZE: 10pt; mso-fareast-language: EN-US"><o:p>&nbsp;</o:p></SPAN></P>

  <P class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Verdana',sans-serif; COLOR: #1f497d; FONT-SIZE: 10pt; mso-fareast-language: EN-US">The 

  advantage of encoding the lemma as a separate attribute is that the 

  concordance can <I>display</I> the actual form that appears in the 

  word-attribute, even if you have <I>searched</I> on the lemma-attribute. 

  Whereas if you replace the word forms, you don’t get 

  that.<o:p></o:p></SPAN></P>

  <P class=MsoNormal><I><SPAN 

  style="FONT-FAMILY: 'Verdana',sans-serif; COLOR: #1f497d; FONT-SIZE: 10pt; mso-fareast-language: EN-US"><o:p>&nbsp;</o:p></SPAN></I></P>

  <P class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Verdana',sans-serif; COLOR: #1f497d; FONT-SIZE: 10pt; mso-fareast-language: EN-US">Hope 

  this helps!<o:p></o:p></SPAN></P>

  <P class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Verdana',sans-serif; COLOR: #1f497d; FONT-SIZE: 10pt; mso-fareast-language: EN-US"><o:p>&nbsp;</o:p></SPAN></P>

  <P class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Verdana',sans-serif; COLOR: #1f497d; FONT-SIZE: 10pt; mso-fareast-language: EN-US">best<o:p></o:p></SPAN></P>

  <P class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Verdana',sans-serif; COLOR: #1f497d; FONT-SIZE: 10pt; mso-fareast-language: EN-US"><o:p>&nbsp;</o:p></SPAN></P>

  <P class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Verdana',sans-serif; COLOR: #1f497d; FONT-SIZE: 10pt; mso-fareast-language: EN-US">Andrew.<o:p></o:p></SPAN></P>

  <P class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Verdana',sans-serif; COLOR: #1f497d; FONT-SIZE: 10pt; mso-fareast-language: EN-US"><o:p>&nbsp;</o:p></SPAN></P>

  <P class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Verdana',sans-serif; COLOR: #1f497d; FONT-SIZE: 10pt; mso-fareast-language: EN-US"><o:p>&nbsp;</o:p></SPAN></P>

  <DIV>

  <DIV 

  style="BORDER-BOTTOM: medium none; BORDER-LEFT: medium none; PADDING-BOTTOM: 0cm; PADDING-LEFT: 0cm; PADDING-RIGHT: 0cm; BORDER-TOP: #e1e1e1 1pt solid; BORDER-RIGHT: medium none; PADDING-TOP: 3pt">

  <P class=MsoNormal><B><SPAN 

  style="FONT-FAMILY: 'Calibri',sans-serif; FONT-SIZE: 11pt" 

  lang=EN-US>From:</SPAN></B><SPAN 

  style="FONT-FAMILY: 'Calibri',sans-serif; FONT-SIZE: 11pt" lang=EN-US> 

  cwb-bounces@sslmit.unibo.it [mailto:cwb-bounces@sslmit.unibo.it] <B>On Behalf 

  Of </B>Ciarán Ó Duibhín<BR><B>Sent:</B> 16 March 2018 18:18<BR><B>To:</B> 

  cwb@sslmit.unibo.it<BR><B>Subject:</B> [CWB] Suggestion: user intervention in 

  constructing an index<o:p></o:p></SPAN></P></DIV></DIV>

  <P class=MsoNormal><o:p>&nbsp;</o:p></P>

  <DIV>

  <P class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Arial',sans-serif; FONT-SIZE: 10pt">I would like to 

  suggest/request a facility in CWB (or its successor) where a user can 

  intervene in the construction of an index.</SPAN><o:p></o:p></P></DIV>

  <DIV>

  <P class=MsoNormal>&nbsp;<o:p></o:p></P></DIV>

  <DIV>

  <P class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Arial',sans-serif; FONT-SIZE: 10pt">I envisage allowing 

  the user to supply a script which can receive the token, extracted from the 

  text and&nbsp;destined to be placed in an index, and can transform it.&nbsp; 

  The transformed&nbsp;token would be placed in the index, rather than the 

  original form.</SPAN><o:p></o:p></P></DIV>

  <DIV>

  <P class=MsoNormal>&nbsp;<o:p></o:p></P></DIV>

  <DIV>

  <P class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Arial',sans-serif; FONT-SIZE: 10pt">The attached 

  concordance output (tobar.jpg) — if attachments are allowed on the list 

  —&nbsp;was made by another program, and shows an example of why I need this 

  facility.</SPAN><o:p></o:p></P></DIV>

  <DIV>

  <P class=MsoNormal>&nbsp;<o:p></o:p></P></DIV>

  <DIV>

  <P class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Arial',sans-serif; FONT-SIZE: 10pt">In my example, under 

  the keyword "bean" are indexed/concorded several different forms, including 

  "bean" and "bhean" and "mbean" and "Bean", among others.&nbsp; As far as I am 

  aware, this cannot be achieved with CWB at 

present.</SPAN><o:p></o:p></P></DIV>

  <DIV>

  <P class=MsoNormal>&nbsp;<o:p></o:p></P></DIV>

  <DIV>

  <P class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Arial',sans-serif; FONT-SIZE: 10pt">In my texts, "bhean" 

  is marked up as "b^hean", and "mbean" as "^mbean".&nbsp; I would like to be 

  able to supply a script which, in my case,&nbsp;would drop the character "^" 

  and the letter immediately following it.</SPAN><o:p></o:p></P></DIV>

  <DIV>

  <P class=MsoNormal>&nbsp;<o:p></o:p></P></DIV>

  <DIV>

  <P class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Arial',sans-serif; FONT-SIZE: 10pt">In&nbsp;displayed 

  contexts, I would need to be able to drop the character "^h" but retain the 

  letter following it.&nbsp; This is what happens in the program which produced 

  the screenshot.</SPAN><o:p></o:p></P></DIV>

  <DIV>

  <P class=MsoNormal>&nbsp;<o:p></o:p></P></DIV>

  <DIV>

  <P class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Arial',sans-serif; FONT-SIZE: 10pt">In my case again, I 

  would also make my script lower-case the token, bringing "Bean" into the 

  family.</SPAN><o:p></o:p></P></DIV>

  <DIV>

  <P class=MsoNormal>&nbsp;<o:p></o:p></P></DIV>

  <DIV>

  <P class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Arial',sans-serif; FONT-SIZE: 10pt">It would further be 

  necessary to allow the script to return more than one keyword.&nbsp; For 

  example, the text might contain "seanbhean", which I encode as 

  "sean+b^hean".&nbsp; My script here would act on the character "+" and return 

  TWO words for the index, "sean" and "bean".&nbsp; Contexts would show 

  "seanbhean", with "^" and "+" both deleted.</SPAN><o:p></o:p></P></DIV>

  <DIV>

  <P class=MsoNormal>&nbsp;<o:p></o:p></P></DIV>

  <DIV>

  <P class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Arial',sans-serif; FONT-SIZE: 10pt">For contexts, it 

  might suffice (for my needs)&nbsp;to give CWB a list of characters to be 

  dropped from contexts, without going to the lengths of allowing a user script 

  for contexts, in addition to the script for 

  keywords.</SPAN><o:p></o:p></P></DIV>

  <DIV>

  <P class=MsoNormal>&nbsp;<o:p></o:p></P></DIV>

  <DIV>

  <P class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Arial',sans-serif; FONT-SIZE: 10pt">With 

  thanks,</SPAN><o:p></o:p></P></DIV>

  <DIV>

  <P class=MsoNormal><SPAN 

  style="FONT-FAMILY: 'Arial',sans-serif; FONT-SIZE: 10pt">Ciarán Ó 

  Duibhín.</SPAN><o:p></o:p></P></DIV>

  <DIV>

  <P class=MsoNormal>&nbsp;<o:p></o:p></P></DIV>

  <DIV>

  <P class=MsoNormal>&nbsp;<o:p></o:p></P></DIV></DIV>

  <P>

  <HR>


  <P></P>_______________________________________________<BR>CWB mailing 

  list<BR>CWB@sslmit.unibo.it<BR>http://liste.sslmit.unibo.it/mailman/listinfo/cwb<BR></BLOCKQUOTE></BODY></HTML>