Skip to content Skip to sidebar Skip to footer

Translate The Intent Of This Php Regex For Multiline Strings, Into Python/perl

Below is a PHP regex intended to match (multiline) strings inside PHP or JavaScript source code (from this post), but I suspect it's got issues. What is the literal Python (or else

Solution 1:

The \\. is meant to match a literal backslash in the pattern, and swallow the following character. Note that since patterns in PHP (and Python) are contained in strings, it would actually need to be \\\\. in the string, so that it ends up as \\. in the regex.

It's important to match the backslash and swallow the following character because it could be used to escape a quote which would otherwise end the match prematurely.

This pattern looks like it should work fine, and I can't think of a more succinct way to express it.

It should also work fine in Python (as you say, with re.DOTALL). In Python you could use the raw string notation to save the extra escaping of the backslash although you'd still need to escape the single quote. This should be equivalent:

re.search(r'\'(\\.|[^\'])*\'|"(\\.|[^"])*"', str, re.DOTALL)

Solution 2:

The regex is mostly okay, except it doesn't handle escaped quotes (i.e., \" and \'). That's easy enough to fix:

'(?:\\.|[^'\\]+)*'|"(?:\\.|[^"\\]+)*"

That's a "generic" regex; in Python you would usually write it in the form of a raw string:

r"""'(?:\\.|[^'\\]+)*'|"(?:\\.|[^"\\]+)*""""

In PHP you have to escape the backslashes to get them past PHP's string processing:

'~\'(?:\\\\.|[^\'\\\\]+)*\'|"(?:\\\\.|[^"\\\\]+)*"~s'

Most of the currently-popular languages have either a string type that requires less escaping, support for regex literals, or both. Here's how your regex would look as a C# verbatim string:

@"'(?:\\.|[^'\\]+)*'|""(?:\\.|[^""\\]+)*"""

But, formatting considerations aside, the regex itself should work in any Perl-derived flavor (and many other flavors as well).


p.s.: Notice how I added the + quantifier to your character classes. Your intuition about matching one character at a time is correct; adding the + makes a huge difference in performance. But don't let that fool you; when you're dealing with regexes, intuition seems to wrong more often than not. :/

Post a Comment for "Translate The Intent Of This Php Regex For Multiline Strings, Into Python/perl"