[CWB] Display options for structural attributes
Lukas Michelbacher
michells at ims.uni-stuttgart.de
Fri Sep 17 14:37:05 CEST 2010
Hello,
I was wondering if there is an easy way to output S-attributes for each
match.
As far as I know [1], when you display S-attributes, they are displayed in
the position in which they actually appear in the corpus [2].
I'd like to be able to say something like "show +story:num" and then get the value
of the num attribute of the story tag for each hit. This could be useful for
computing tf-idf weights, for example. E.g. the query
> "A"
would yield the result
2: A/DT/a/1
11: A/DT/a/2
Otherwise, I'd have to encode the story number as a P-attribute for each
token, which would store redundant information and require more annoying
preprocessing ;).
Regards,
Lukas
--
Dipl.-Ling. Lukas Michelbacher
Institute for Natural Language Processing
University of Stuttgart
phone: +49 (0)711-685-84587
fax : +49 (0)711-685-81366
email: michells at ims.uni-stuttgart.de
[1]
my knowledge is based on http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/CWBTutorial/cwb-tutorial.pdf
[2]
This is my example corpus:
<!-- A Thrilling Experience -->
<story num="1" title="A Thrilling Experience">
<p>
<s>
Tick NN tick
. SENT .
</s>
<s>
A DT a
clock NN clock
. SENT .
</s>
<s>
Tick VB tick
, , ,
tick VB tick
. SENT .
</s>
</p>
</story>
<story num="2" title="A Thrilling Experience 2">
<p>
<s>
Tock NN tock
. SENT .
</s>
<s>
A DT a
click NN click
. SENT .
</s>
<s>
Tock VB tock
, , ,
tock VB tock
. SENT .
</s>
</p>
</story>
I encoded it with CWB-2.2.b99-RC1 using the following options:
-D -B -s -x -s -P pos -P lemma -S s:0 -S p:0 -V story:0+num+title
More information about the CWB
mailing list