[CWB] Too many tokens in an attribute?

Stefan Evert stefanML at collocations.de
Sun Apr 30 10:35:52 CEST 2017


Two short comments:

1) The hard boundary setting can actually be overridden with an explicit within clause (which people should get much more in the habit of using, e.g. to avoid unintended matches across sentence boundaries), so

	<msg>[]*</msg>  :: match.msg_msg_id = "162578" within msg

works.  IIRC, there used to be a "hard" hard limit buried somewhere in the source code, which could not be exceeded by a user specification.  We seem to have removed this restriction at some time.

2) If you want to return a complete XML region, the right way to do this is

	<msg> [] :: match.msg_msg_id = "162578" expand to msg

This will be much more efficient for a large corpus and is not affected by hard boundary settings (because "expand" is applied at a later stage).  The approach generalized to situations where you're looking for a <msg> region containing a more complex search pattern:

	… pattern … within msg expand to msg;

where the pattern may include a global constraint (like :: match.msg_msg_id = "162578").  The within clause is crucial so (i) you don't get a hard boundary on pattern and (ii) each match is guaranteed to be a single <msg> region.

Best,
Stefan


> On 29 Apr 2017, at 08:49, Simone Ueberwasser <simone at ueberwasser.net> wrote:
> 
> In this example, the first message contains 3 tokens. If I run the following query, the message is found:
> <msg>[]*</msg>:: match.msg_msg_id = “162577"
> 
> The second message is 95 tokens long. The same query shows no results:
> <msg>[]*</msg>:: match.msg_msg_id = "162578"
> 
> 



More information about the CWB mailing list