[CWB] [ cwb-Bugs-3514300 ] Numbered backrefs in string-level regex not working

SourceForge.net noreply at sourceforge.net
Tue Apr 3 01:53:56 CEST 2012


Bugs item #3514300, was opened at 2012-04-02 16:53
Message generated for change (Tracker Item Submitted) made by andrewhardie
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=722303&aid=3514300&group_id=131809

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: CL low-level library
Group: TODO-3.5
Status: Open
Resolution: None
Priority: 7
Private: No
Submitted By: Andrew Hardie (andrewhardie)
Assigned to: Andrew Hardie (andrewhardie)
Summary: Numbered backrefs in string-level regex not working

Initial Comment:
Only one of the various possible syntaxes for backrefs within a regex seem to be working.

For instance:

"(.)\1"

should in theory find all forms consisting of the same character twice. However, it's not working. Neither are the following, which according to man pcre should be equivalent:

"(.)\g1"
"(.)\g{1}"

The following, however, DOES work, even though it SHOULD be identical to the preceding:

"(?P<name>.)\g{name}"

I suspect this is due to the regex optimiser and its lack of full PCRE-awareness (in cl/regopt.c) -- i.e. it is doing an incorrect optimisation and doing simple string matching on the first three but not on the fourth -- but cannot be sure without further investigation.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=722303&aid=3514300&group_id=131809


More information about the CWB mailing list