<div dir="ltr">Thanks again, Stephanie!</div><br><div class="gmail_quote gmail_quote_container"><div dir="ltr" class="gmail_attr">El vie, 27 jun 2025 a la(s) 8:46 a.m., Stephanie Evert (<a href="mailto:stefanML@collocations.de">stefanML@collocations.de</a>) escribió:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div><blockquote type="cite"><div><span style="font-family:ArialMT;font-size:14px;font-style:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none;float:none;display:inline">Thank you, Stephanie, this is what I was looking for. For the regex, I guess I can do something like text_id = "(ID1|ID2|IDn)”</span></div></blockquote><div><br></div><div>You can also read them as a word list and compile the regexp automatically with the RE() operator:</div><div><br></div><div><span style="white-space:pre-wrap"> </span>define $texts < "text_ids.txt";</div><div><span style="white-space:pre-wrap"> </span>Texts = <text_id = RE($texts)> [] expand to text;</div><div><br></div><div>But it has the same limits, namely …</div><div><br></div><blockquote type="cite"><div><div style="font-family:ArialMT;font-size:14px;font-style:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none">In the other hand, when you said “so this can be tedious (or not work at all) if you have a very long list of text IDs”, which thing could not work? If I have like, say, 100 docs, could this approach not work?</div></div></blockquote><br></div><div>There is a length limit for strings in CWB, which we've been increasing over the years. Worse, regexp implementations often have their own limits, which *do not* throw and error, but silently ignore the rest of the regexp. So you might not be matching all IDs without ever noticing. Perhaps a good idea to check with</div><div><br></div><div><span style="white-space:pre-wrap"> </span>tabulate Texts match text_id > "test.txt";</div><div><br></div><div>and compare test.txt with text_ids.txt.</div><div><br></div><div>Best,</div><div>Stephanie</div><div><br></div><div>PS: Some people (esp. Python users) try to optimised the regexp by combining shared prefixes (there's a Python package for doing this automatically). This is even worse, because PCRE1 (which the released version of CWB still uses) doesn't support deeply nested parentheses and will just silently discard them.</div><div><br></div><br></div>_______________________________________________<br>
CWB mailing list<br>
<a href="mailto:CWB@sslmit.unibo.it" target="_blank">CWB@sslmit.unibo.it</a><br>
<a href="http://liste.sslmit.unibo.it/mailman/listinfo/cwb" rel="noreferrer" target="_blank">http://liste.sslmit.unibo.it/mailman/listinfo/cwb</a><br>
</blockquote></div>