[CWB] [ cwb-Bugs-3058703 ] Regex optimisation does not respect PCRE esc seqs

SourceForge.net noreply at sourceforge.net
Fri Sep 3 12:40:26 CEST 2010


Bugs item #3058703, was opened at 2010-09-03 10:40
Message generated for change (Tracker Item Submitted) made by andrewhardie
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=722303&aid=3058703&group_id=131809

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: CL low-level library
Group: None
Status: Open
Resolution: None
Priority: 6
Private: No
Submitted By: Andrew Hardie (andrewhardie)
Assigned to: Stefan Evert (schtepf)
Summary: Regex optimisation does not respect PCRE esc seqs

Initial Comment:
Regex optimisation does not respect PCRE esc seqs

PCRE has many more escape sequences than POSIX regex (inc Unicode properties) some of which consist of more than one character after the backslash.

For instance, the following are all single characters / wildcards:

\p{M}
\p{Lu}
\P{Pf}
\x{1a024}

and, even worse, there is a legal abbreviation of Unicode properties to 

\pL

To quote man pcre:
---------
If only one letter is specified with \p or \P, it includes all the gen-
eral  category properties that start with that letter. In this case, in
the absence of negation, the curly brackets in the escape sequence  are
optional; these two examples have the same effect:

         \p{L}
         \pL
---------
(something like /\pLust/ is not ambiguous because you are only allowed to abbreviate 1-character-properties, not 2-character-properties).

The CL regex optimiser scans for grains in a regex on the assumption that a \ only affects the following character and nothing afterwards. This assumption is clearly wrong for PCRE, at least in UTF8 mode, and could cause grains to be detected that are not real grains, and thus candidate strings to be falsely rejected.

For current users --- this bug won't kick in if you stick to basic regex syntax!

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=722303&aid=3058703&group_id=131809


More information about the CWB mailing list