[CWB] question

Tue Mar 4 16:10:23 CET 2008

> The corpus has word, pos and lemma as positional attributes and <text
>> and <s> as  structural attributes; is it possible to extract the mean
> length of <s> in the corpus, in number of words?

If you really just want to know the average sentence length in your  
corpus, the answer is simply corpus size divided by number of  
sentences (assuming that all text material in the corpus is enclosed  
in <s> regions). You can get the necessary information from cwb- 
describe-corpus, e.g.

$ cwb-describe-corpus -s BROWN

============================================================
Corpus: BROWN
============================================================
...

size (tokens):  1170811

...

s-ATT s                     52108 regions (with annotations)

...

So there are 52108 sentences in the Brown corpus with a total of 1.17  
million tokens, which gives you an average sentence length of 22.5  
tokens (including punctuation).

Of course, if you're interested in more sensible statistical  
information - medium, quantiles, histogram, etc. - you'll have to use  
statistical software like R or write a Perl script yourself. Here's a  
Unix command line that generates a list of sentence lengths (measured  
in tokens including punctuation) and saves it to a file with 1  
sentence per line.

$ cwb-s-decode BROWN -S s | awk  '{print $2 - $1 + 1}' >  
brown_s_length.csv

You can easily process this file with Perl, or load it into R and  
even Excel (which should recognise it as a CSV format file).

Best wishes,
Stefan