<div dir="ltr"><div dir="ltr">Thanks so much, Stephanie! It's great to have multiple solutions. </div><div dir="ltr"><br></div><div>All the best,</div><div>Scott</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Mar 22, 2022 at 6:43 AM Stephanie Evert <<a href="mailto:stefanML@collocations.de">stefanML@collocations.de</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">> I have a corpus which is divided into de facto subcorpora using an s-attribute, and I need to count the number of tokens in each subcorpus. Are there any issues with doing this by searching for [word=".+"] while selecting each of the s-attribute values and using the number of matches returned as the token count? Is there a better way to do this (ideally, one which would return all the match counts at once)?<br>
<br>
Let's assume that the s-attribute in question is <div_cat>, for the sake of exposition. There are three ways of obtaining the subcorpus sizes:<br>
<br>
1) The only efficient solution is to use cwb-s-decode together with a Perl, Python or R script for aggregating counts (or use available packages in one of those programming languages for direct corpus access).<br>
<br>
2) The lazy solution – if you don't care about wasting time and memory – works in CQP:<br>
<br>
Tokens = [];<br>
group Tokens match div_cat;<br>
<br>
(and you'll probably want to set PrettyPrint off; and redirect the frequency table to a TSV file).<br>
<br>
3) As a compromise, you can use cwb-scan-corpus on the command-line. It is still relatively inefficient, but considerably faster than solution 2 and very memory-efficient.<br>
<br>
cwb-scan-corpus -o subcorpus_sizes.tsv CORPUS div_cat+0<br>
<br>
Best,<br>
Stephanie<br>
<br>
</blockquote></div><br clear="all"><div><br></div></div>