<div dir="ltr"><div dir="ltr">Thanks so much, Stephanie! It&#39;s great to have multiple solutions. </div><div dir="ltr"><br></div><div>All the best,</div><div>Scott</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Mar 22, 2022 at 6:43 AM Stephanie Evert &lt;<a href="mailto:stefanML@collocations.de">stefanML@collocations.de</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">&gt; I have a corpus which is divided into de facto subcorpora using an s-attribute, and I need to count the number of tokens in each subcorpus. Are there any issues with doing this by searching for [word=&quot;.+&quot;] while selecting each of the s-attribute values and using the number of matches returned as the token count? Is there a better way to do this (ideally, one which would return all the match counts at once)?<br>

<br>

Let&#39;s assume that the s-attribute in question is &lt;div_cat&gt;, for the sake of exposition.  There are three ways of obtaining the subcorpus sizes:<br>

<br>

1) The only efficient solution is to use cwb-s-decode together with a Perl, Python or R script for aggregating counts (or use available packages in one of those programming languages for direct corpus access).<br>

<br>

2) The lazy solution – if you don&#39;t care about wasting time and memory – works in CQP:<br>

<br>

        Tokens = [];<br>

        group Tokens match div_cat;<br>

<br>

(and you&#39;ll probably want to set PrettyPrint off; and redirect the frequency table to a TSV file).<br>

<br>

3) As a compromise, you can use cwb-scan-corpus on the command-line. It is still relatively inefficient, but considerably faster than solution 2 and very memory-efficient.<br>

<br>

        cwb-scan-corpus -o subcorpus_sizes.tsv CORPUS div_cat+0<br>

<br>

Best,<br>

Stephanie<br>

<br>

</blockquote></div><br clear="all"><div><br></div></div>