<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<!--[if !mso]><style>v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style><![endif]--><style><!--
/* Font Definitions */
@font-face
        {font-family:Wingdings;
        panose-1:5 0 0 0 0 0 0 0 0 0;}
@font-face
        {font-family:"Cambria Math";
        panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
        {font-family:Verdana;
        panose-1:2 11 6 4 3 5 4 4 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0cm;
        margin-bottom:.0001pt;
        font-size:12.0pt;
        font-family:"Times New Roman",serif;}
a:link, span.MsoHyperlink
        {mso-style-priority:99;
        color:blue;
        text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
        {mso-style-priority:99;
        color:purple;
        text-decoration:underline;}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
        {mso-style-priority:34;
        margin-top:0cm;
        margin-right:0cm;
        margin-bottom:0cm;
        margin-left:36.0pt;
        margin-bottom:.0001pt;
        font-size:12.0pt;
        font-family:"Times New Roman",serif;}
p.msonormal0, li.msonormal0, div.msonormal0
        {mso-style-name:msonormal;
        mso-margin-top-alt:auto;
        margin-right:0cm;
        mso-margin-bottom-alt:auto;
        margin-left:0cm;
        font-size:12.0pt;
        font-family:"Times New Roman",serif;}
span.EmailStyle18
        {mso-style-type:personal;
        font-family:"Verdana",sans-serif;
        color:#1F497D;
        font-weight:normal;
        font-style:normal;
        text-decoration:none none;}
span.EmailStyle21
        {mso-style-type:personal-reply;
        font-family:"Verdana",sans-serif;
        color:#1F497D;
        font-weight:normal;
        font-style:normal;
        text-decoration:none none;}
.MsoChpDefault
        {mso-style-type:export-only;
        font-size:10.0pt;}
@page WordSection1
        {size:612.0pt 792.0pt;
        margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
        {page:WordSection1;}
/* List Definitions */
@list l0
        {mso-list-id:96758683;
        mso-list-type:hybrid;
        mso-list-template-ids:1228724376 1718550506 134807555 134807557 134807553 134807555 134807557 134807553 134807555 134807557;}
@list l0:level1
        {mso-level-start-at:0;
        mso-level-number-format:bullet;
        mso-level-text:;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-18.0pt;
        font-family:Wingdings;
        mso-fareast-font-family:Calibri;
        mso-bidi-font-family:"Times New Roman";}
@list l0:level2
        {mso-level-number-format:bullet;
        mso-level-text:o;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-18.0pt;
        font-family:"Courier New";}
@list l0:level3
        {mso-level-number-format:bullet;
        mso-level-text:;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-18.0pt;
        font-family:Wingdings;}
@list l0:level4
        {mso-level-number-format:bullet;
        mso-level-text:;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-18.0pt;
        font-family:Symbol;}
@list l0:level5
        {mso-level-number-format:bullet;
        mso-level-text:o;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-18.0pt;
        font-family:"Courier New";}
@list l0:level6
        {mso-level-number-format:bullet;
        mso-level-text:;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-18.0pt;
        font-family:Wingdings;}
@list l0:level7
        {mso-level-number-format:bullet;
        mso-level-text:;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-18.0pt;
        font-family:Symbol;}
@list l0:level8
        {mso-level-number-format:bullet;
        mso-level-text:o;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-18.0pt;
        font-family:"Courier New";}
@list l0:level9
        {mso-level-number-format:bullet;
        mso-level-text:;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-18.0pt;
        font-family:Wingdings;}
ol
        {margin-bottom:0cm;}
ul
        {margin-bottom:0cm;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body bgcolor="white" lang="EN-GB" link="blue" vlink="purple">
<div class="WordSection1">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US">&gt;&gt;</span><span style="font-size:10.0pt;font-family:&quot;Courier New&quot;"> I am not comfortable with the idea of storing two columns to hold
 things which (unlike&nbsp;with normal lemmatisation)&nbsp;can be automatically generated from one column —&nbsp;during the indexing process, if access by a&nbsp;user-supplied script were usable there, acting on the text shown in column 1&nbsp;to produce what is shown in column 2.</span><o:p></o:p></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US">But as I’ve explained, there is already a way to do that if you don’t want a permanent multi-column file – just put your user script
 into a pipeline with cwb-encode on the end. IE:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p>&nbsp;</o:p></span></p>
<ul style="margin-top:0cm" type="disc">
<li class="MsoListParagraph" style="color:#1F497D;margin-left:0cm;mso-list:l0 level1 lfo1">
<span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;mso-fareast-language:EN-US">cat one-col-file |&nbsp; column-transform-script | cwb-encode [options]<o:p></o:p></span></li></ul>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US">&gt;&gt;
</span><span style="font-size:10.0pt;font-family:&quot;Courier New&quot;">I am unsure&nbsp;how the extra-column approach will handle the case where a single token of text is to be split into two index items (column 2),</span><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US">In CWB, “index items”
<i>are</i> tokens, i.e. there is no distinction between these concepts. If you split a word across two lines, it’s now two tokens, not one, and a space between the two items will render in the concordance.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US">Some display systems like BNCweb remove non-original orthographic spaces from the CQP concordance. (BNCweb does this by having an additional
 binary p-attribute storing the “orthographic-space-after” data). &nbsp;BUT the items which are thus visually merged are still 2 tokens in the underlying index, and are counted/searched for as such. So rendering them as one visual unit actually makes the underlying
 data less understandable to the user.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US">Putting a &#43; after “sean” in the index will create issues, because instances of “sean&#43;” will be treated as distinct from instances of
 “sean” (so a search for “sean” would find only the latter). I don’t recommend this approach.
<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US">Generally speaking: It is a
<i>very</i> unusual step in tokenisation to split compound words into two tokens. It means that user would have to search for the compound as a sequence of two tokens – even if it renders as one. If you are the only user that might be what you want, but if
 there are other users, it will make the needed search patterns rather counter intuitive.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US">EG, people would need to search for<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal" style="text-indent:36.0pt"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US">[demut=&quot;sean&quot;] [demut=&quot;bean&quot;]<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US">instead of<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal" style="text-indent:36.0pt"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US">[demut=&quot;seanbean&quot;]<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US">&gt;&gt;</span><span style="font-size:10.0pt;font-family:&quot;Courier New&quot;"> I would still like the CWB developers to consider my request for
 the facility to execute&nbsp;a user-supplied script at these two points in the process.</span><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US">You can use a pipeline in indexing to address the first point (of indexing), as I mentioned above, and you can address the second point
 (of rendering) by writing a display program which lays things out to your liking using one of the interface libraries i.e. the CWB-Perl modules or the cqp.inc.php module from CQPweb.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US">Or, if you prefer, just write your rendering script to pipe text in and out of a cqp slave instance (which is what the Perl and PHP
 libraries do behind the scenes). <o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US">best<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US">Andrew.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p>&nbsp;</o:p></span></p>
<div>
<div style="border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0cm 0cm 0cm">
<p class="MsoNormal"><b><span lang="EN-US" style="font-size:11.0pt;font-family:&quot;Calibri&quot;,sans-serif">From:</span></b><span lang="EN-US" style="font-size:11.0pt;font-family:&quot;Calibri&quot;,sans-serif"> cwb-bounces@sslmit.unibo.it [mailto:cwb-bounces@sslmit.unibo.it]
<b>On Behalf Of </b>Ciarán Ó Duibhín<br>
<b>Sent:</b> 20 March 2018 14:22<br>
<b>To:</b> Open source development of the Corpus WorkBench &lt;cwb@sslmit.unibo.it&gt;<br>
<b>Subject:</b> Re: [CWB] Suggestion: user intervention in constructing an index<o:p></o:p></span></p>
</div>
</div>
<p class="MsoNormal"><o:p>&nbsp;</o:p></p>
<div>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Courier New&quot;">Thanks, Andrew, for those constructive ideas.</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">&nbsp;<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">&nbsp;<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Courier New&quot;">I have experimented with your second suggestion of adding a &quot;lemma&quot; column to the input.&nbsp; (For info,&nbsp;what is marked up&nbsp;in my text is *partial* lemmatisation, covering changes at the
 beginning of words,&nbsp;so I'll call it &quot;demut(ation)&quot; rather than &quot;lemma&quot;.&nbsp;&nbsp;Full lemmatisation&nbsp;would require attention to terminal inflection as well.)</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">&nbsp;<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">&nbsp;<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Courier New&quot;">So, I could generate extra columns, like this:</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Courier New&quot;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; b^hean&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; bean&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; bhean</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Courier New&quot;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ^mbean&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; bean&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; mbean</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Courier New&quot;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Bean&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; bean&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Bean</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Courier New&quot;">The first column is what is in the text;&nbsp;this column&nbsp;can be removed from the&nbsp;file&nbsp;when the other two have been generated from it.&nbsp; The second is the index term (&quot;demut&quot;).&nbsp; The third
 is what I want to see in contexts (&quot;word&quot;).</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">&nbsp;<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">&nbsp;<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Courier New&quot;">While this will work, I am not comfortable with the idea of storing two columns to hold things which (unlike&nbsp;with normal lemmatisation)&nbsp;can be automatically generated from one column
 —&nbsp;during the indexing process, if access by a&nbsp;user-supplied script were usable there, acting on the text shown in column 1&nbsp;to produce what is shown in column 2.</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">&nbsp;<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">&nbsp;<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Courier New&quot;">Turning from the index keywords to the contexts, I am unsure&nbsp;how the extra-column approach will handle the case where a single token of text is to be split into two index items (column
 2), which should be displayed in context without any space between them.</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Courier New&quot;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sean&#43;b^hean&nbsp;&nbsp; sean&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sean&#43;</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Courier New&quot;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;bean&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;bhean</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Courier New&quot;">Here I have used a &#43; sign at the end of an item in column 3, to show that I wish to have no space inserted in the context before the following word. Is there already a way of doing&nbsp;this&nbsp;in
 CWB?&nbsp; If not, access by a&nbsp;user-supplied script&nbsp;to the production of contexts could act on the text shown in column 1 to produce&nbsp;&quot;seanbhean&quot;.</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">&nbsp;<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">&nbsp;<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Courier New&quot;">Software of my own gives proof of concept&nbsp;of&nbsp;processing text marked up as&nbsp;in column 1 above, allowing interpretation of the markup during both the extraction of indexing terms and
 the production of contexts,&nbsp;and I would still like the CWB developers to consider my request for the facility to execute&nbsp;a user-supplied script at these two points in the process.</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">&nbsp;<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">&nbsp;<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Courier New&quot;">Many thanks again for your advice,</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Courier New&quot;">Ciarán.</span><o:p></o:p></p>
</div>
<blockquote style="border:none;border-left:solid black 1.5pt;padding:0cm 0cm 0cm 4.0pt;margin-left:3.75pt;margin-top:5.0pt;margin-right:0cm;margin-bottom:5.0pt">
<div>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif">----- Original Message -----
<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal" style="background:#E4E4E4"><b><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif">From:</span></b><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif">
<a href="mailto:a.hardie@lancaster.ac.uk" title="a.hardie@lancaster.ac.uk">Hardie, Andrew</a>
<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><b><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif">To:</span></b><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif">
<a href="mailto:cwb@sslmit.unibo.it" title="cwb@sslmit.unibo.it">Open source development of the Corpus WorkBench</a>
<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><b><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif">Sent:</span></b><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif"> Friday, March 16, 2018 7:04 PM<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><b><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif">Subject:</span></b><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif"> Re: [CWB] Suggestion: user intervention in constructing an index<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><o:p>&nbsp;</o:p></p>
</div>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US">Hi</span>
<span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US">
Ciarán,<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US">There are two answers here…
<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US">First, it most certainly is already possible to adjust the form of the words as they are indexed. Simply prepare a script to make the
 change and pipe your files through it into the cwb-encode standard input (cwb-encode reads from standard input if no files are specified).<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US">(Or just run your converter separately on the data to create a modified version, and then index that, to avoid mucking about with pipes!)<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US">Second, although that is the direct answer to your question, actually it is probably not “the right thing” to do. What you are talking
 about here is effectively lemmatisation – since <i>bean/bhean/mbean</i> are different forms of a single lemma, converting them all to “bean” means lemmatising. So what you’re talking about is indexing the lemma in place of the wordform. But the “right way”
 to do this in CWB is to add the lemma as a separate attribute – allowing the lemma to be queried, as well as / instead of the word.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US">This means adding the lemma as a second column of the input file, like thus:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US">Bean&nbsp;&nbsp; bean<o:p></o:p></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US">(…)<o:p></o:p></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US">ar&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ar<o:p></o:p></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US">mbean bean<o:p></o:p></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US">(…)<o:p></o:p></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US">mo&nbsp;&nbsp;&nbsp;&nbsp; mo<o:p></o:p></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US">bhean&nbsp; bean<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US">(and likewise for plural forms of
<i>bean</i>, etc etc.)<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US">I don’t know what lemmatisation tool is considered standard for Gaelic at the moment, but I guess there must be options out there?
 &nbsp;<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US">You can then do queries like this:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US">&nbsp;&nbsp;&nbsp;&nbsp; [lemma=&quot;bean&quot;];<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US">… to retrieve
<i>bean/mbean/bhean</i> all at the same time.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US">The advantage of encoding the lemma as a separate attribute is that the concordance can
<i>display</i> the actual form that appears in the word-attribute, even if you have
<i>searched</i> on the lemma-attribute. Whereas if you replace the word forms, you don’t get that.<o:p></o:p></span></p>
<p class="MsoNormal"><i><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p>&nbsp;</o:p></span></i></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US">Hope this helps!<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US">best<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US">Andrew.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p>&nbsp;</o:p></span></p>
<div>
<div style="border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0cm 0cm 0cm">
<p class="MsoNormal"><b><span lang="EN-US" style="font-size:11.0pt;font-family:&quot;Calibri&quot;,sans-serif">From:</span></b><span lang="EN-US" style="font-size:11.0pt;font-family:&quot;Calibri&quot;,sans-serif">
<a href="mailto:cwb-bounces@sslmit.unibo.it">cwb-bounces@sslmit.unibo.it</a> [<a href="mailto:cwb-bounces@sslmit.unibo.it">mailto:cwb-bounces@sslmit.unibo.it</a>]
<b>On Behalf Of </b>Ciarán Ó Duibhín<br>
<b>Sent:</b> 16 March 2018 18:18<br>
<b>To:</b> <a href="mailto:cwb@sslmit.unibo.it">cwb@sslmit.unibo.it</a><br>
<b>Subject:</b> [CWB] Suggestion: user intervention in constructing an index<o:p></o:p></span></p>
</div>
</div>
<p class="MsoNormal"><o:p>&nbsp;</o:p></p>
<div>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif">I would like to suggest/request a facility in CWB (or its successor) where a user can intervene in the construction of an index.</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">&nbsp;<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif">I envisage allowing the user to supply a script which can receive the token, extracted from the text and&nbsp;destined to be placed in an index, and can transform it.&nbsp; The transformed&nbsp;token
 would be placed in the index, rather than the original form.</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">&nbsp;<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif">The attached concordance output (tobar.jpg) — if attachments are allowed on the list —&nbsp;was made by another program, and shows an example of why I need this facility.</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">&nbsp;<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif">In my example, under the keyword &quot;bean&quot; are indexed/concorded several different forms, including &quot;bean&quot; and &quot;bhean&quot; and &quot;mbean&quot; and &quot;Bean&quot;, among others.&nbsp; As far as I am aware,
 this cannot be achieved with CWB at present.</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">&nbsp;<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif">In my texts, &quot;bhean&quot; is marked up as &quot;b^hean&quot;, and &quot;mbean&quot; as &quot;^mbean&quot;.&nbsp; I would like to be able to supply a script which, in my case,&nbsp;would drop the character &quot;^&quot; and the letter
 immediately following it.</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">&nbsp;<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif">In&nbsp;displayed contexts, I would need to be able to drop the character &quot;^h&quot; but retain the letter following it.&nbsp; This is what happens in the program which produced the screenshot.</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">&nbsp;<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif">In my case again, I would also make my script lower-case the token, bringing &quot;Bean&quot; into the family.</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">&nbsp;<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif">It would further be necessary to allow the script to return more than one keyword.&nbsp; For example, the text might contain &quot;seanbhean&quot;, which I encode as &quot;sean&#43;b^hean&quot;.&nbsp; My script
 here would act on the character &quot;&#43;&quot; and return TWO words for the index, &quot;sean&quot; and &quot;bean&quot;.&nbsp; Contexts would show &quot;seanbhean&quot;, with &quot;^&quot; and &quot;&#43;&quot; both deleted.</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">&nbsp;<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif">For contexts, it might suffice (for my needs)&nbsp;to give CWB a list of characters to be dropped from contexts, without going to the lengths of allowing a user script for contexts,
 in addition to the script for keywords.</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">&nbsp;<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif">With thanks,</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif">Ciarán Ó Duibhín.</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">&nbsp;<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">&nbsp;<o:p></o:p></p>
</div>
<div class="MsoNormal" align="center" style="text-align:center">
<hr size="2" width="100%" align="center">
</div>
<p class="MsoNormal">_______________________________________________<br>
CWB mailing list<br>
<a href="mailto:CWB@sslmit.unibo.it">CWB@sslmit.unibo.it</a><br>
<a href="http://liste.sslmit.unibo.it/mailman/listinfo/cwb">http://liste.sslmit.unibo.it/mailman/listinfo/cwb</a><o:p></o:p></p>
</blockquote>
</div>
</body>
</html>