<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
        {font-family:"Cambria Math";
        panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
        {font-family:Verdana;
        panose-1:2 11 6 4 3 5 4 4 2 4;}
@font-face
        {font-family:Monaco;
        panose-1:0 0 0 0 0 0 0 0 0 0;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0cm;
        margin-bottom:.0001pt;
        font-size:12.0pt;
        font-family:"Times New Roman",serif;}
a:link, span.MsoHyperlink
        {mso-style-priority:99;
        color:blue;
        text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
        {mso-style-priority:99;
        color:purple;
        text-decoration:underline;}
code
        {mso-style-priority:99;
        font-family:"Courier New";}
tt
        {mso-style-priority:99;
        font-family:"Courier New";}
p.msonormal0, li.msonormal0, div.msonormal0
        {mso-style-name:msonormal;
        mso-margin-top-alt:auto;
        margin-right:0cm;
        mso-margin-bottom-alt:auto;
        margin-left:0cm;
        font-size:12.0pt;
        font-family:"Times New Roman",serif;}
span.EmailStyle20
        {mso-style-type:personal-reply;
        font-family:"Verdana",sans-serif;
        color:#1F497D;
        font-weight:normal;
        font-style:normal;
        text-decoration:none none;}
.MsoChpDefault
        {mso-style-type:export-only;
        font-size:10.0pt;}
@page WordSection1
        {size:612.0pt 792.0pt;
        margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
        {page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-GB" link="blue" vlink="purple">
<div class="WordSection1">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#1F497D;mso-fareast-language:EN-US">Hi Maarten,<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#1F497D;mso-fareast-language:EN-US">>></span> there is a conceptually odd pattribute nbc in there<span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#1F497D;mso-fareast-language:EN-US">This is in the 0.99 version, but not the 1.0 version – so yes, you’re right, it is odd and it has been removed! The 1.0 files are in
the repo here:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><a href="https://sourceforge.net/p/cwb/code/HEAD/tree/doc/corpora/dickens/release/">https://sourceforge.net/p/cwb/code/HEAD/tree/doc/corpora/dickens/release/</a>
<span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#1F497D;mso-fareast-language:EN-US">
<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#1F497D;mso-fareast-language:EN-US">>></span> what is the intended logic behind embedded sattributes?<o:p></o:p></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#1F497D;mso-fareast-language:EN-US">The intended logic is make it possible to do
<i>something</i> with embedded XML elements using a system that was not designed for them.
<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#1F497D;mso-fareast-language:EN-US">CWB predates XML. S-attributes were designed originally to represent non-overlapping, equivalent-status regions that divide up a text
sequentially. (Like sentences, paragraphs, chapters…) Each s-att is entirely separate and there is no expectation that the regions of one will pay any attention to the regions of another. Given this design it is very hard to re-tool the system to deal with
self-embedding, direct or indirect. <o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#1F497D;mso-fareast-language:EN-US">The current approach is a better-than-nothing attempt to deal with embedding by encoded embedded instances of an XML tag onto the separate
attributes created by the numbers. So CQP doesn’t really “know” about these in any sense.
<b>np1</b> and <b>np2</b> are as different as <b>text</b> and <b>chapter</b>. Thus no support in search.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#1F497D;mso-fareast-language:EN-US">>></span><np h=“ironmongery”> seems more intuitive than <np_h=“ironmongery”> given that the latter is not properly XML<span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#1F497D;mso-fareast-language:EN-US">The whole thing predates XML. Again, this is a way of expressing XML-style attributes within the constraints of an architecture not
designed to do that – by the creation of extra attributes. CQP does not “know” that
<b>np</b> has anything to do with <b>nph</b>. The syntax suggestion would definitely be more intuitive but it requires a different data structure than the one we’ve got.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#1F497D;mso-fareast-language:EN-US">You’ll be glad to know that one of our major goals is to include support for actual full XML tree structures in the new data engine.
Stefan and I concluded 3 or 4 years ago that there was no way to add this via expanding the existing s-attribute model, so it will mean a from-scratch data architecture and a new attribute type.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#1F497D;mso-fareast-language:EN-US">best<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#1F497D;mso-fareast-language:EN-US">Andrew.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana",sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<div>
<div style="border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0cm 0cm 0cm">
<p class="MsoNormal"><b><span lang="EN-US" style="font-size:11.0pt;font-family:"Calibri",sans-serif">From:</span></b><span lang="EN-US" style="font-size:11.0pt;font-family:"Calibri",sans-serif"> cwb-bounces@sslmit.unibo.it <cwb-bounces@sslmit.unibo.it>
<b>On Behalf Of </b>Maarten Janssen<br>
<b>Sent:</b> 24 April 2019 12:44<br>
<b>To:</b> cwb@sslmit.unibo.it<br>
<b>Subject:</b> [CWB] Embedded sattributes<o:p></o:p></span></p>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<p class="MsoNormal">Hi,<o:p></o:p></p>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">When attempting to wrong a (conceptually odd) script to convert a compiled CQP corpus to a TEITOK corpus (from which you can then in turn create a CQP corpus again, potentially after editing), I noticed two strange things when looking into
the DICKENS example corpus that I used to test the script, and maybe somebody can clarify them for me.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">The first is there is a conceptually odd pattribute nbc in there, specifying which chapter of which novel a token belongs to. With that, you can search for a:[nbc=“<span style="font-family:"Monaco",serif">A Christmas Carol, Ch. 1</span>”]
to only find words from that specific chapter. But why is that there? Am I missing something or does this not do exactly the same, while being much cleaner: a:[] :: a.novel_title=“A Christmas Carol” & <span style="font-family:"Monaco",serif">chapter_num</span>=“1”<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">The second is more tricky, and has to do with embedded sattributes and how they work - which is never trivial since despite sattributes in principle just being regions, which could happily overlap, CQP somehow ignores all embedded attributes
completely - it would be difficult to get overlapping or embedded regions from a VRT file, but even writing CQP files directly, the searches completely overlook them. What is mentioned in the encoding PDF about embedded xml attributes is this:<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<div>
<p class="MsoNormal"> If you want to preserve nested elements, you can specify a maximal level of embedding instead of
<tt><span style="font-size:10.0pt">:0</span></tt> in the examples above. For instance,
<code><span style="font-size:10.0pt">-S table:2</span></code> allows two levels of embedding for
<code><span style="font-size:10.0pt"><table></span></code> elements. Nested elements are automatically renamed to
<code><span style="font-size:10.0pt"><table1></span></code> and <code><span style="font-size:10.0pt"><table2></span></code>, respectively, and stored in separate s-attributes. <o:p></o:p></p>
</div>
</blockquote>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">Looking at the dickens example corpus, these embedded sattributes are not treated like normal attributes, since the <np> in dickens apparently has 2 embedding levels (not specified as np:2 in the registry file, it just lists the renamed
structures), since the whole <np> block is together:<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<div>
<div>
<p class="MsoNormal"># <np h=".." len=".."> ... </np><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"># (2 levels of embedding: <np>, <np1>, <np2>)<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">STRUCTURE np<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">STRUCTURE np1<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">STRUCTURE np2<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">STRUCTURE np_h # [annotations]<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">STRUCTURE np_h1 # [annotations]<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">STRUCTURE np_h2 # [annotations]<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">STRUCTURE np_len # [annotations]<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">STRUCTURE np_len1 # [annotations]<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">STRUCTURE np_len2 # [annotations]<o:p></o:p></p>
</div>
</div>
</blockquote>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">And the surprising thing is that the renaming is not total: the head of np1 is not called np1_h, but rather np_h1 - which I noticed since that makes it a lot more difficult to get back to the supposed vrt format given that you have to explicitly
treat with those (does that imply numbers are not allowed at the end of sattributes?). So that makes you hope there is some fancy treatment of them in the search - but that seems not the case. So either I am missing something, or the treatment of embedded
sattributes makes things more difficult rather than easier. Let me clarify.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">Since there is no vrt file for dickens, I have to assume what the input might look like, but I assume this (is there btw any option to make cwb-decode produce this type of output? -Cx does not do attributes… I now just rework the output
in a script, but there might be complexities I overlook):<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<div>
<div>
<p class="MsoNormal"><span style="font-family:"Monaco",serif"><np h=“ironmongery” len=“4"><o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Monaco",serif">ironmongery NN ironmongery A Christmas Carol, Ch. 1<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Monaco",serif"><pp h=“in” len=“3"><o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Monaco",serif">in IN in A Christmas Carol, Ch. 1<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Monaco",serif"><np h=“trade” len=“2"><o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Monaco",serif">the DT the A Christmas Carol, Ch. 1<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Monaco",serif">trade NN trade A Christmas Carol, Ch. 1<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Monaco",serif"></np><o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Monaco",serif"></pp><o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Monaco",serif"></np><o:p></o:p></span></p>
</div>
</div>
</blockquote>
<div>
<p class="MsoNormal"><br>
<br>
<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">Given that there are embedded <np> here, they get renamed to <np1> and <np2>, which makes it possible to have both nps in the corpus - since otherwise the second one would get ignored even when added to the corpus. And they hence get a
special treatment being renamed np_h1 instead of np1_h as mentioned before - a special treatment that makes a slightly modified CQP syntax I think I heard mention much more difficult: <np h=“ironmongery”> seems more intuitive than <np_h=“ironmongery”> given
that the latter is not properly XML - but <np1 h=“ironmonger”> would hence not work and <np h1=“ironmonger”> seems even more odd than np_h1. <o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">Now the way that is encoded seems - at least at face value - to make it not so much easier to use <np>, but more difficult, since you first have to know how the system happened to name them; so you cannot just look for nps with the head
“ironmongery”, since you have to specify it is embedded at level 1:<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<div>
<div>
<p class="MsoNormal"><span style="font-family:"Monaco",serif">DICKENS> a:[word="ironmongery"] :: a.np_h="ironmongery"; <o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Monaco",serif">0 matches.<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Monaco",serif">DICKENS> a:[word="ironmongery"] :: a.np_h1="ironmongery"; <o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Monaco",serif"> 194: l as the deadest piece of <<span style="color:white;background:black">ironmongery</span>> in the trade . But the w<o:p></o:p></span></p>
</div>
</div>
</blockquote>
<div>
<p class="MsoNormal"><br>
<br>
<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">Also notice that the treatment does not seem uniform: there are occurrences of <np> in the corpus, so you would expect those to be related to non-embedded cases; but the <pp> in this example is not embedded at all, and still name <pp1>.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><br>
<br>
<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">So to get back to the actual question: what is the intended logic behind embedded sattributes?<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">Maarten<o:p></o:p></p>
</div>
</div>
</div>
</body>
</html>