[CWB] Aternative and patterns order in queries

Sébastien Jacquot sebastien.jacquot at univ-fcomte.fr
Wed Apr 8 12:16:19 CEST 2015


Le 07/04/2015 15:29, Stefan Evert a écrit :
>> Do you know why these 2 queries don't return the same tokens ?
>>
>> <text>[!q]+<q> | </q>[!q]+<q> | </q>[!q]+</text>;
>>
>> </q>[!q]+</text> | <text>[!q]+<q> | </q>[!q]+<q>;
>>
>> The first query doesn't work as expected, the returned tokens match only the first alternative pattern part : <text>[!q]+<q>
>> as if the pipe character would act like the OR boolean condition instead of the REGEX alternative.
> I can't fully reproduce this problem: on my machine, only the third alternative pattern is ignored, the first two are matched.  Are you sure that your system behaves differently?
Hi,
thank you very much for these informations.
Yes, sorry, actually the 2nd alternative pattern returns well the tokens 
as expected, so only the third pattern is ignored.
>
> What's happening here is that CQP tries to match opening and closing XML tags and ensures that you don't skip region boundaries in between, so
>
> 	<q> []+ </q>
>
> is guaranteed to stay within a single <q> region.  Because this amount of XML support wasn't envisaged in the original implementation of CQP, we use a simple trick that isn't aware of the "|" disjunctions.  As soon as there is an open tag followed by a corresponding close tag _somewhere_ in the query, _all_ open and close tags for this s-attribute are expected to match up.
>
> CQP thus expects that in the query
>
> 	<text>[!q]+<q> | </q>[!q]+<q> | </q>[!q]+</text>;
>
> the final </text> is the closing tag corresponding to <text> at the start of the query, but this is can never be satisfied because they are in two different branches of the query, so the third branch never matches.
>
> You might expect the same to happen with the initial </q> in the second (and third) branch, but such positions are matched with a different method that does not carry out the check.
OK, I understand now.
>
> There are several different work-arounds:
>
> 1) Replace the final </text> by an rbound() constraint
>
> 	<text>[!q]+<q> | </q>[!q]+<q> | </q>[!q]* [!q & rbound(text)];
>
> This is rather unsafe, since you need to be aware of precisely which closing tags will be checked.
>
> 2) Disable matching of open and close tags for this query, which is probably the best and fastest solution.
>
> 	set StrictRegions off;
> 	<text>[!q]+<q> | </q>[!q]+<q> | </q>[!q]+</text>;
>
> 3) Run three separate queries and use set operations to combine the results.
>
> 	A1 = <text>[!q]+<q>;
> 	A2 = </q>[!q]+<q>;
> 	A3 = </q>[!q]+</text>;
> 	A = union A1 A2;
> 	A = union A A3;
>
The 1st and 2nd workaround work well with the tested corpus .
The 3rd doesn't seem to work, I didn't identify why at this moment. For 
now I only do the tests by testing that the sum of inside and outside 
"q" tokens size is equal to the root corpora tokens size, I need to 
investigate further.

>> The second query seems to work as expected and returns all the tokens outside the "q" tag.
> Of course, a simple (if rather slow) solution would be to use the longest match strategy and simply search for sequences of tokens outside <q> regions:
>
> 	set ms longest;
> 	[!q]+ within text;
>
> You can make this more efficient by specifying relevant start points
>
> 	(<text> | </q>) [!q]+ within text;
Indeed both queries work on the tested corpora and the first can be long 
to execute, but really more simple that what I tried, I didn't think 
about it.
>
> Two other remarks:
>
> 1) You should probably add "within text" to your queries.  The second alternative might cross a </text> boundary otherwise!
>
> 2) Keep in mind that there's a global limit on the length of a match set by the HardBoundary option.  You may need to increase the default setting of 100 tokens.
Thanks again for all these informations and advices.
About the point 1 just below and the other query :

</q>[!q]+</text> | <text>[!q]+<q> | </q>[!q]+<q>;


What do you think about this query, do you think it could cause "wrong" 
results in some cases ?
Besides that, could it avoid the </text> boundary crossing ? (I guess)

I would have a complementary question, I didn't see something about it 
in the CQPTutorial.pdf (november 2009) file on Sourceforge. Is there 
anyway to do and display arithmetic operations in CQP ? The purpose here 
would be to compute :
size rootCorporaAllTokens - (size subcorpora1Tokens + size 
subcorpora2Tokens) which should be equal to 0 (related to the structure 
and queries in this discussion).

Sebastian

>
> Best,
> Stefan
>
>
> PS: Anybody feel like adding an explanation of the problem to the CQP query tutorial or an FAQ?
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>

-- 
ELLIADD, EA 4661
UFR SLHS - Université de Franche-Comté
30-32 rue Mégevand
25030 Besançon cedex
03.81.66.54.22



More information about the CWB mailing list