[CWB] question
Stefan Evert
stefan.evert at uos.de
Tue Mar 4 16:10:23 CET 2008
> The corpus has word, pos and lemma as positional attributes and <text
>> and <s> as structural attributes; is it possible to extract the mean
> length of <s> in the corpus, in number of words?
If you really just want to know the average sentence length in your
corpus, the answer is simply corpus size divided by number of
sentences (assuming that all text material in the corpus is enclosed
in <s> regions). You can get the necessary information from cwb-
describe-corpus, e.g.
$ cwb-describe-corpus -s BROWN
============================================================
Corpus: BROWN
============================================================
...
size (tokens): 1170811
...
s-ATT s 52108 regions (with annotations)
...
So there are 52108 sentences in the Brown corpus with a total of 1.17
million tokens, which gives you an average sentence length of 22.5
tokens (including punctuation).
Of course, if you're interested in more sensible statistical
information - medium, quantiles, histogram, etc. - you'll have to use
statistical software like R or write a Perl script yourself. Here's a
Unix command line that generates a list of sentence lengths (measured
in tokens including punctuation) and saves it to a file with 1
sentence per line.
$ cwb-s-decode BROWN -S s | awk '{print $2 - $1 + 1}' >
brown_s_length.csv
You can easily process this file with Perl, or load it into R and
even Excel (which should recognise it as a CSV format file).
Best wishes,
Stefan
More information about the CWB
mailing list