[CWB] [ cwb-Bugs-3514300 ] Numbered backrefs in string-level regex not working

SourceForge.net noreply at sourceforge.net
Tue May 1 14:33:00 CEST 2012


Bugs item #3514300, was opened at 2012-04-02 16:53
Message generated for change (Comment added) made by schtepf
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=722303&aid=3514300&group_id=131809

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: CL low-level library
Group: TODO-3.5
>Status: Closed
>Resolution: Fixed
Priority: 7
Private: No
Submitted By: Andrew Hardie (andrewhardie)
Assigned to: Andrew Hardie (andrewhardie)
Summary: Numbered backrefs in string-level regex not working

Initial Comment:
Only one of the various possible syntaxes for backrefs within a regex seem to be working.

For instance:

"(.)\1"

should in theory find all forms consisting of the same character twice. However, it's not working. Neither are the following, which according to man pcre should be equivalent:

"(.)\g1"
"(.)\g{1}"

The following, however, DOES work, even though it SHOULD be identical to the preceding:

"(?P<name>.)\g{name}"

I suspect this is due to the regex optimiser and its lack of full PCRE-awareness (in cl/regopt.c) -- i.e. it is doing an incorrect optimisation and doing simple string matching on the first three but not on the fourth -- but cannot be sure without further investigation.

----------------------------------------------------------------------

>Comment By: Stefan Evert (schtepf)
Date: 2012-05-01 05:33

Message:
Fixed in trunk in revision #313 (as suggested in comments).

----------------------------------------------------------------------

Comment By: Andrew Hardie (andrewhardie)
Date: 2012-04-03 01:37

Message:
Hmm, that would explain it! 

I will change the wrap to a non-capturing bracket as soon as I get a
chance.

----------------------------------------------------------------------

Comment By: Stefan Evert (schtepf)
Date: 2012-04-03 01:30

Message:
The CL internally rewrites the entered regexp <r> into ^(<r>)$ to enforce
the anchoring. This was implemented in order to use standard regexp
libraries; previously, the CWB included a specially hacked regexp
implementation able to enforce anchoring.

Solution for this: change the rewrite to ^(?:<r>)$ (or whatever the correct
PCRE syntax for non-capturing parentheses was.

The possibility of regexp optimiser problems should be investigated,
though. I tried to be very conservative, but I'm not sure how escape
sequences like \g are handled.

----------------------------------------------------------------------

Comment By: Serge Heiden (sheiden)
Date: 2012-04-03 01:25

Message:
 "(.)\2" does what "(.)\1" should do actually.
There is apparently a +1 shift in the RE groups buffers in PCRE
[TXM 0.6b2, CQP 3.4, Linux]

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=722303&aid=3514300&group_id=131809


More information about the CWB mailing list