Find Capturing Groups In User Submitted Regex
Solution 1:
Regular expression language is not a regular language so it cannot be reliably split into meaningful parts by a regex (see RegEx match open tags except XHTML self-contained tags for the same case for HTML).
Why not use Python's own parser instead to do this?
>>> r="whate(ever)(?:\\1)"
>>> import sre_parse #the module used by `re' internally for regex parsing
>>> sre_parse.parse(r)
[('literal', 119), ('literal', 104), ('literal', 97), ('literal', 116),
('literal', 101), ('subpattern', (1, [('literal', 101), ('literal', 118), ('lit
eral', 101), ('literal', 114)])), ('subpattern', (None, [('groupref', 1)]))]
As you can see, this is a parse tree, and you're interested in subpattern
nodes with non-None
in the first element and groupref
's.
Solution 2:
Your regex is very naive. (In fact, I had a hard time finding a group construct that is matched by your regex.)
To avoid false positives, like for example [(bar)]
, it's necessary to parse/match the entire pattern from left to right. I've come up with this regex:
^(?:\[(?:\\.|[^\]])*\]|\(\?:|[^(\\]|\\\D|\(\?[gmixsu])*$
Explanation:
^ # start of string anchor
(?: # this group matches a single valid expression:
# character classes:
\[ # the opening [
(?:
\\. # any escaped character
| # or
[^\]] # anything that's not a closing ]
)* # any number of times
\] # the closing ]
|
# non-capturing groups:
\(\?: # (?: literally
|
# normal characters (anything that's not a backslash or (
[^(\\]
|
# meta sequences like \s
\\\D
|
# inline modifiers like (?i)
\(\?[gmixsu]
)* # any number of valid expressions.
$ # end of string anchor
P.S.: This regex does not guarantee that the pattern is valid. (Compiling the pattern can still fail.)
Post a Comment for "Find Capturing Groups In User Submitted Regex"