[CWB] Embedded sattributes

Wed Apr 24 13:43:51 CEST 2019

Hi,

When attempting to wrong a (conceptually odd) script to convert a compiled CQP corpus to a TEITOK corpus (from which you can then in turn create a CQP corpus again, potentially after editing), I noticed two strange things when looking into the DICKENS example corpus that I used to test the script, and maybe somebody can clarify them for me.

The first is there is a conceptually odd pattribute nbc in there, specifying which chapter of which novel a token belongs to. With that, you can search for a:[nbc=“A Christmas Carol, Ch. 1”] to only find words from that specific chapter. But why is that there? Am I missing something or does this not do exactly the same, while being much cleaner: a:[] :: a.novel_title=“A Christmas Carol” & chapter_num=“1”

The second is more tricky, and has to do with embedded sattributes and how they work - which is never trivial since despite sattributes in principle just being regions, which could happily overlap, CQP somehow ignores all embedded attributes completely - it would be difficult to get overlapping or embedded regions from a VRT file, but even writing CQP files directly, the searches completely overlook them. What is mentioned in the encoding PDF about embedded xml attributes is this:

>   If you want to preserve nested elements, you can specify a maximal level of embedding instead of :0 in the examples above. For instance, -S table:2 allows two levels of embedding for <table> elements. Nested elements are automatically renamed to <table1> and <table2>, respectively, and stored in separate s-attributes.  

Looking at the dickens example corpus, these embedded sattributes are not treated like normal attributes, since the <np> in dickens apparently has 2 embedding levels (not specified as np:2 in the registry file, it just lists the renamed structures), since the whole <np> block is together:

> # <np h=".." len=".."> ... </np>
> # (2 levels of embedding: <np>, <np1>, <np2>)
> STRUCTURE np
> STRUCTURE np1
> STRUCTURE np2
> STRUCTURE np_h                 # [annotations]
> STRUCTURE np_h1                # [annotations]
> STRUCTURE np_h2                # [annotations]
> STRUCTURE np_len               # [annotations]
> STRUCTURE np_len1              # [annotations]
> STRUCTURE np_len2              # [annotations]

And the surprising thing is that the renaming is not total: the head of np1 is not called np1_h, but rather np_h1 - which I noticed since that makes it a lot more difficult to get back to the supposed vrt format given that you have to explicitly treat with those (does that imply numbers are not allowed at the end of sattributes?). So that makes you hope there is some fancy treatment of them in the search - but that seems not the case. So either I am missing something, or the treatment of embedded sattributes makes things more difficult rather than easier. Let me clarify.

Since there is no vrt file for dickens, I have to assume what the input might look like, but I assume this (is there btw any option to make cwb-decode produce this type of output? -Cx does not do attributes… I now just rework the output in a script, but there might be complexities I overlook):

> <np h=“ironmongery” len=“4">
> ironmongery     NN      ironmongery     A Christmas Carol, Ch. 1
> <pp h=“in” len=“3">
> in      IN      in      A Christmas Carol, Ch. 1
> <np h=“trade” len=“2">
> the     DT      the     A Christmas Carol, Ch. 1
> trade   NN      trade   A Christmas Carol, Ch. 1
> </np>
> </pp>
> </np>

Given that there are embedded <np> here, they get renamed to <np1> and <np2>, which makes it possible to have both nps in the corpus - since otherwise the second one would get ignored even when added to the corpus. And they hence get a special treatment being renamed np_h1 instead of np1_h as mentioned before - a special treatment that makes a slightly modified CQP syntax I think I heard mention much more difficult: <np h=“ironmongery”> seems more intuitive than <np_h=“ironmongery”> given that the latter is not properly XML - but <np1 h=“ironmonger”> would hence not work and <np h1=“ironmonger”> seems even more odd than np_h1. 

Now the way that is encoded seems - at least at face value - to make it not so much easier to use <np>, but more difficult, since you first have to know how the system happened to name them; so you cannot just look for nps with the head “ironmongery”, since you have to specify it is embedded at level 1:

> DICKENS> a:[word="ironmongery"] :: a.np_h="ironmongery"; 
> 0 matches.
> DICKENS> a:[word="ironmongery"] :: a.np_h1="ironmongery"; 
>       194: l as the deadest piece of <ironmongery> in the trade . But the w

Also notice that the treatment does not seem uniform: there are occurrences of <np> in the corpus, so you would expect those to be related to non-embedded cases; but the <pp> in this example is not embedded at all, and still name <pp1>.

So to get back to the actual question: what is the intended logic behind embedded sattributes?

Maarten
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20190424/aabb0291/attachment.html>