[CWB] Embedded sattributes

Wed Apr 24 14:48:21 CEST 2019

Hi Maarten,

>> there is a conceptually odd pattribute nbc in there

This is in the 0.99 version, but not the 1.0 version – so yes, you’re right, it is odd and it has been removed! The 1.0 files are in the repo here:

https://sourceforge.net/p/cwb/code/HEAD/tree/doc/corpora/dickens/release/

>> what is the intended logic behind embedded sattributes?

The intended logic is make it possible to do something with embedded XML elements using a system that was not designed for them.

CWB predates XML. S-attributes were designed originally to represent non-overlapping, equivalent-status regions that divide up a text sequentially. (Like sentences, paragraphs, chapters…) Each s-att is entirely separate and there is no expectation that the regions of one will pay any attention to the regions of another. Given this design it is very hard to re-tool the system to deal with self-embedding, direct or indirect.

The current approach is a better-than-nothing attempt to deal with embedding by encoded embedded instances of an XML tag onto the separate attributes created by the numbers. So CQP doesn’t really “know” about these in any sense. np1 and np2 are as different as text and chapter. Thus no support in search.

>><np h=“ironmongery”> seems more intuitive than <np_h=“ironmongery”> given that the latter is not properly XML

The whole thing predates XML. Again, this is a way of expressing XML-style attributes within the constraints of an architecture not designed to do that – by the creation of extra attributes. CQP does not “know” that np has anything to do with nph. The syntax suggestion would definitely be more intuitive but it requires a different data structure than the one we’ve got.

You’ll be glad to know that one of our major goals is to include support for actual full XML tree structures in the new data engine. Stefan and I concluded 3 or 4 years ago that there was no way to add this via expanding the existing s-attribute model, so it will mean a from-scratch data architecture and a new attribute type.

best

Andrew.

From: cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> On Behalf Of Maarten Janssen
Sent: 24 April 2019 12:44
To: cwb at sslmit.unibo.it
Subject: [CWB] Embedded sattributes

Hi,

When attempting to wrong a (conceptually odd) script to convert a compiled CQP corpus to a TEITOK corpus (from which you can then in turn create a CQP corpus again, potentially after editing), I noticed two strange things when looking into the DICKENS example corpus that I used to test the script, and maybe somebody can clarify them for me.

The first is there is a conceptually odd pattribute nbc in there, specifying which chapter of which novel a token belongs to. With that, you can search for a:[nbc=“A Christmas Carol, Ch. 1”] to only find words from that specific chapter. But why is that there? Am I missing something or does this not do exactly the same, while being much cleaner: a:[] :: a.novel_title=“A Christmas Carol” & chapter_num=“1”

The second is more tricky, and has to do with embedded sattributes and how they work - which is never trivial since despite sattributes in principle just being regions, which could happily overlap, CQP somehow ignores all embedded attributes completely - it would be difficult to get overlapping or embedded regions from a VRT file, but even writing CQP files directly, the searches completely overlook them. What is mentioned in the encoding PDF about embedded xml attributes is this:

  If you want to preserve nested elements, you can specify a maximal level of embedding instead of :0 in the examples above. For instance, -S table:2 allows two levels of embedding for <table> elements. Nested elements are automatically renamed to <table1> and <table2>, respectively, and stored in separate s-attributes.

Looking at the dickens example corpus, these embedded sattributes are not treated like normal attributes, since the <np> in dickens apparently has 2 embedding levels (not specified as np:2 in the registry file, it just lists the renamed structures), since the whole <np> block is together:

# <np h=".." len=".."> ... </np>
# (2 levels of embedding: <np>, <np1>, <np2>)
STRUCTURE np
STRUCTURE np1
STRUCTURE np2
STRUCTURE np_h                 # [annotations]
STRUCTURE np_h1                # [annotations]
STRUCTURE np_h2                # [annotations]
STRUCTURE np_len               # [annotations]
STRUCTURE np_len1              # [annotations]
STRUCTURE np_len2              # [annotations]

And the surprising thing is that the renaming is not total: the head of np1 is not called np1_h, but rather np_h1 - which I noticed since that makes it a lot more difficult to get back to the supposed vrt format given that you have to explicitly treat with those (does that imply numbers are not allowed at the end of sattributes?). So that makes you hope there is some fancy treatment of them in the search - but that seems not the case. So either I am missing something, or the treatment of embedded sattributes makes things more difficult rather than easier. Let me clarify.

Since there is no vrt file for dickens, I have to assume what the input might look like, but I assume this (is there btw any option to make cwb-decode produce this type of output? -Cx does not do attributes… I now just rework the output in a script, but there might be complexities I overlook):

<np h=“ironmongery” len=“4">
ironmongery     NN      ironmongery     A Christmas Carol, Ch. 1
<pp h=“in” len=“3">
in      IN      in      A Christmas Carol, Ch. 1
<np h=“trade” len=“2">
the     DT      the     A Christmas Carol, Ch. 1
trade   NN      trade   A Christmas Carol, Ch. 1
</np>
</pp>
</np>

Given that there are embedded <np> here, they get renamed to <np1> and <np2>, which makes it possible to have both nps in the corpus - since otherwise the second one would get ignored even when added to the corpus. And they hence get a special treatment being renamed np_h1 instead of np1_h as mentioned before - a special treatment that makes a slightly modified CQP syntax I think I heard mention much more difficult: <np h=“ironmongery”> seems more intuitive than <np_h=“ironmongery”> given that the latter is not properly XML - but <np1 h=“ironmonger”> would hence not work and <np h1=“ironmonger”> seems even more odd than np_h1.

Now the way that is encoded seems - at least at face value - to make it not so much easier to use <np>, but more difficult, since you first have to know how the system happened to name them; so you cannot just look for nps with the head “ironmongery”, since you have to specify it is embedded at level 1:

DICKENS> a:[word="ironmongery"] :: a.np_h="ironmongery";
0 matches.
DICKENS> a:[word="ironmongery"] :: a.np_h1="ironmongery";
      194: l as the deadest piece of <ironmongery> in the trade . But the w

Also notice that the treatment does not seem uniform: there are occurrences of <np> in the corpus, so you would expect those to be related to non-embedded cases; but the <pp> in this example is not embedded at all, and still name <pp1>.

So to get back to the actual question: what is the intended logic behind embedded sattributes?

Maarten
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20190424/51e7224b/attachment-0001.html>