[CWB] Embedded sattributes

Wed Apr 24 15:46:04 CEST 2019

Hi Maarten!

> Right - then it would prob. be good to update the link here, since that is where I got the corpus today, and which is where most other people are also likely to get it, and which indeed points to version 0.99: 
> 
> http://cwb.sourceforge.net/download.php#corpora

All part of overdue updates to release packages, documentation and the Web page. :-(

>> The current approach is a better-than-nothing attempt to deal with embedding by encoded embedded instances of an XML tag onto the separate attributes created by the numbers. So CQP doesn?t really ?know? about these in any sense. np1 and np2 are as different as text and chapter. Thus no support in search.
> 
> Understood. But why are they then simple np1_h ? I see that makes no difference for the system, but it seems the more logical naming somehow…

Due to the way the automatic renaming is implemented in cwb-encode, it was easier to generate attributes names like np_h1 than it was to generate np1_h … 

Since there is no syntactic sugar for querying embedded attributes, you have to know the convention anyway, so it didn't seem to make much of a difference.

For your CWB-to-TEI script, I'd suggest the following heuristic:

	1. consider only s-attributes that don't end in a digit (i.e. cannot be embedded) – note that the naming convention actually makes this step easier

	2. identify automatically re-named XML-attributes, e.g. <np>, <np_h>, <np_len> => <np h="…" len="…">; you should also check that the actual regions are identical (or at least you have the same number of regions) in order to avoid false positives

	3. now scan for embedded regions by appending 1, 2, … to all attribute names; each embedding level should have entries for all s-attributes in a group

> Now the way that is encoded seems - at least at face value - to make it not so much easier to use <np>, but more difficult, since you first have to know how the system happened to name them;

Yes. As Andrew explained, it is a better-than-nothing solution for supporting XML annotation without having to re-write large parts of CWB.

> so you cannot just look for nps with the head “ironmongery”, since you have to specify it is embedded at level 1:
> 
>> DICKENS> a:[word="ironmongery"] :: a.np_h="ironmongery"; 
>> 0 matches.
>> DICKENS> a:[word="ironmongery"] :: a.np_h1="ironmongery"; 
>>       194: l as the deadest piece of <ironmongery> in the trade . But the w

The CQP tutorial should have some recommendations how to work with embedded regions.  In this case, you need to write the query as

	 a:[word="ironmongery"] :: a.np_h="ironmongery" | a.np_h1="ironmongery" | a.np_h2="ironmongery"; 

Unfortunately, there is no easy way of providing syntactic sugar for this construct.

> Also notice that the treatment does not seem uniform: there are occurrences of <np> in the corpus, so you would expect those to be related to non-embedded cases; but the <pp> in this example is not embedded at all, and still name <pp1>.

That's not the case (or else it must be a bug in v0.99 that has been fixed in v1.0): 

> DICKENS> show +np +np1 +pp +pp1
> DICKENS> cat
>       194:  <np>I</np> might have been inclined , <np><np1>myself</np1> ,</np> to regard <np>a coffin-nail</np> as <np>the deadest piece <pp>of <<np1>ironmongery> <pp1>in the trade</pp1></np1></pp></np> .

Of course, the attachment made by the chunker is wrong …

Best,
Stefan