[CWB] Aternative and patterns order in queries

Tue Apr 7 15:29:11 CEST 2015

> Do you know why these 2 queries don't return the same tokens ?
> 
> <text>[!q]+<q> | </q>[!q]+<q> | </q>[!q]+</text>;
> 
> </q>[!q]+</text> | <text>[!q]+<q> | </q>[!q]+<q>;
> 
> The first query doesn't work as expected, the returned tokens match only the first alternative pattern part : <text>[!q]+<q>
> as if the pipe character would act like the OR boolean condition instead of the REGEX alternative.

I can't fully reproduce this problem: on my machine, only the third alternative pattern is ignored, the first two are matched.  Are you sure that your system behaves differently?

What's happening here is that CQP tries to match opening and closing XML tags and ensures that you don't skip region boundaries in between, so

	<q> []+ </q>

is guaranteed to stay within a single <q> region.  Because this amount of XML support wasn't envisaged in the original implementation of CQP, we use a simple trick that isn't aware of the "|" disjunctions.  As soon as there is an open tag followed by a corresponding close tag _somewhere_ in the query, _all_ open and close tags for this s-attribute are expected to match up.

CQP thus expects that in the query

	<text>[!q]+<q> | </q>[!q]+<q> | </q>[!q]+</text>;

the final </text> is the closing tag corresponding to <text> at the start of the query, but this is can never be satisfied because they are in two different branches of the query, so the third branch never matches.

You might expect the same to happen with the initial </q> in the second (and third) branch, but such positions are matched with a different method that does not carry out the check.

There are several different work-arounds:

1) Replace the final </text> by an rbound() constraint

	<text>[!q]+<q> | </q>[!q]+<q> | </q>[!q]* [!q & rbound(text)];

This is rather unsafe, since you need to be aware of precisely which closing tags will be checked.

2) Disable matching of open and close tags for this query, which is probably the best and fastest solution.

	set StrictRegions off;
	<text>[!q]+<q> | </q>[!q]+<q> | </q>[!q]+</text>;

3) Run three separate queries and use set operations to combine the results.

	A1 = <text>[!q]+<q>;
	A2 = </q>[!q]+<q>;
	A3 = </q>[!q]+</text>;
	A = union A1 A2;
	A = union A A3;

> The second query seems to work as expected and returns all the tokens outside the "q" tag.

Of course, a simple (if rather slow) solution would be to use the longest match strategy and simply search for sequences of tokens outside <q> regions:

	set ms longest;
	[!q]+ within text;

You can make this more efficient by specifying relevant start points

	(<text> | </q>) [!q]+ within text;

Two other remarks:

1) You should probably add "within text" to your queries.  The second alternative might cross a </text> boundary otherwise!

2) Keep in mind that there's a global limit on the length of a match set by the HardBoundary option.  You may need to increase the default setting of 100 tokens.

Best,
Stefan

PS: Anybody feel like adding an explanation of the problem to the CQP query tutorial or an FAQ?