Skip to content Skip to sidebar Skip to footer

Find Capturing Groups In User Submitted Regex

I have a python application which needs to process user submitted regular expressions. Because of performance considerations i want to forbid capturing groups and back references.

Solution 1:

Regular expression language is not a regular language so it cannot be reliably split into meaningful parts by a regex (see RegEx match open tags except XHTML self-contained tags for the same case for HTML).

Why not use Python's own parser instead to do this?

>>> r="whate(ever)(?:\\1)"
>>> import sre_parse        #the module used by `re' internally for regex parsing
>>> sre_parse.parse(r)
[('literal', 119), ('literal', 104), ('literal', 97), ('literal', 116),
 ('literal', 101), ('subpattern', (1, [('literal', 101), ('literal', 118), ('lit
eral', 101), ('literal', 114)])), ('subpattern', (None, [('groupref', 1)]))]

As you can see, this is a parse tree, and you're interested in subpattern nodes with non-None in the first element and groupref's.


Solution 2:

Your regex is very naive. (In fact, I had a hard time finding a group construct that is matched by your regex.)

To avoid false positives, like for example [(bar)], it's necessary to parse/match the entire pattern from left to right. I've come up with this regex:

^(?:\[(?:\\.|[^\]])*\]|\(\?:|[^(\\]|\\\D|\(\?[gmixsu])*$

regex101 demo.


Explanation:

^   # start of string anchor
(?:  # this group matches a single valid expression:
# character classes:
    \[  # the opening [
    (?: 
        \\.  # any escaped character
    | # or 
        [^\]] # anything that's not a closing ]
    )* # any number of times
    \]  # the closing ]
|
# non-capturing groups:
    \(\?:  # (?: literally
|
# normal characters (anything that's not a backslash or (
    [^(\\]
|
# meta sequences like \s
    \\\D
|
# inline modifiers like (?i)
    \(\?[gmixsu]
)* # any number of valid expressions.
$  # end of string anchor

P.S.: This regex does not guarantee that the pattern is valid. (Compiling the pattern can still fail.)


Post a Comment for "Find Capturing Groups In User Submitted Regex"