[CWB] [ cwb-Bugs-3514300 ] Numbered backrefs in string-level regex
not working
SourceForge.net
noreply at sourceforge.net
Tue May 1 14:33:00 CEST 2012
Bugs item #3514300, was opened at 2012-04-02 16:53
Message generated for change (Comment added) made by schtepf
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=722303&aid=3514300&group_id=131809
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: CL low-level library
Group: TODO-3.5
>Status: Closed
>Resolution: Fixed
Priority: 7
Private: No
Submitted By: Andrew Hardie (andrewhardie)
Assigned to: Andrew Hardie (andrewhardie)
Summary: Numbered backrefs in string-level regex not working
Initial Comment:
Only one of the various possible syntaxes for backrefs within a regex seem to be working.
For instance:
"(.)\1"
should in theory find all forms consisting of the same character twice. However, it's not working. Neither are the following, which according to man pcre should be equivalent:
"(.)\g1"
"(.)\g{1}"
The following, however, DOES work, even though it SHOULD be identical to the preceding:
"(?P<name>.)\g{name}"
I suspect this is due to the regex optimiser and its lack of full PCRE-awareness (in cl/regopt.c) -- i.e. it is doing an incorrect optimisation and doing simple string matching on the first three but not on the fourth -- but cannot be sure without further investigation.
----------------------------------------------------------------------
>Comment By: Stefan Evert (schtepf)
Date: 2012-05-01 05:33
Message:
Fixed in trunk in revision #313 (as suggested in comments).
----------------------------------------------------------------------
Comment By: Andrew Hardie (andrewhardie)
Date: 2012-04-03 01:37
Message:
Hmm, that would explain it!
I will change the wrap to a non-capturing bracket as soon as I get a
chance.
----------------------------------------------------------------------
Comment By: Stefan Evert (schtepf)
Date: 2012-04-03 01:30
Message:
The CL internally rewrites the entered regexp <r> into ^(<r>)$ to enforce
the anchoring. This was implemented in order to use standard regexp
libraries; previously, the CWB included a specially hacked regexp
implementation able to enforce anchoring.
Solution for this: change the rewrite to ^(?:<r>)$ (or whatever the correct
PCRE syntax for non-capturing parentheses was.
The possibility of regexp optimiser problems should be investigated,
though. I tried to be very conservative, but I'm not sure how escape
sequences like \g are handled.
----------------------------------------------------------------------
Comment By: Serge Heiden (sheiden)
Date: 2012-04-03 01:25
Message:
"(.)\2" does what "(.)\1" should do actually.
There is apparently a +1 shift in the RE groups buffers in PCRE
[TXM 0.6b2, CQP 3.4, Linux]
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=722303&aid=3514300&group_id=131809
More information about the CWB
mailing list