[CWB] Display options for structural attributes

Lukas Michelbacher michells at ims.uni-stuttgart.de
Fri Sep 17 14:37:05 CEST 2010


Hello,

I was wondering if there is an easy way to output S-attributes for each
match.

As far as I know [1],  when you display S-attributes, they are displayed in
the position in which they actually appear in the corpus [2].

I'd like to be able to say something like "show +story:num" and then get the value
of the num attribute of the story tag for each hit.  This could be useful for
computing tf-idf weights, for example.  E.g. the query

> "A"

would yield the result

2: A/DT/a/1
11: A/DT/a/2

Otherwise, I'd have to encode the story number as a P-attribute for each
token, which would store redundant information and require more annoying
preprocessing ;).

Regards,

Lukas

--
Dipl.-Ling. Lukas Michelbacher
Institute for Natural Language Processing
University of Stuttgart

phone: +49 (0)711-685-84587
fax  : +49 (0)711-685-81366
email: michells at ims.uni-stuttgart.de

[1]

my knowledge is based on http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/CWBTutorial/cwb-tutorial.pdf

[2]

This is my example corpus:

<!-- A Thrilling Experience -->
<story num="1" title="A Thrilling Experience">
<p>
<s>
Tick	NN	tick
.	SENT	.
</s>
<s>
A	DT	a
clock	NN	clock
.	SENT	.
</s>
<s>
Tick	VB	tick
,	,	,
tick	VB	tick
.	SENT	.
</s>
</p>
</story>

<story num="2" title="A Thrilling Experience 2">
<p>
<s>
Tock	NN	tock
.	SENT	.
</s>
<s>
A	DT	a
click	NN	click
.	SENT	.
</s>
<s>
Tock	VB	tock
,	,	,
tock	VB	tock
.	SENT	.
</s>
</p>
</story>

I encoded it with CWB-2.2.b99-RC1 using the following options:

-D -B -s -x -s -P pos -P lemma -S s:0 -S p:0 -V story:0+num+title


More information about the CWB mailing list