Finding Character Position From A Regex'd String

June 17, 2023 Post a Comment

Say, I have 'ABBACDA', and I want to find the index of where each A is. Using Regex in Python, match = re.findall('A', 'ABBACDA') only returns a list. Any way I can tweak this? Or

Solution 1:

Use re.finditer()

example 1:

match = re.finditer('A', 'ABBACDA')
for m inmatch:
    print m.start(), m.end(), m.group(0)

output:

01A34A67A

example 2:

match = re.finditer('BB', 'ABBACDA')
for m inmatch:
    print m.start(), m.end(), m.group(0)

output :

1 3 BB

Solution 2:

To John Machin

« (1) in your "find" code, tof is "A", in your regex code, tof is "BB" »

No. There is a 'find' code with tof = 'A' , and there is a code comparing 'find' method and 'regex' method with tof = 'BB'. There' is not an alone 'regex' code with tof = 'BB'

The 'find' code presents this solution in an isolated manner to be clearer because in the second code its lisibility is obfsucated.

I put the definition of pat = re.compile(tof) out of the loops in order to not count the time of its creation in the measured time. As it is in the beginning of the second code, you believed that it is an entirely 'regex' code. That's not so. It's a comparison code.

You should read more thoroughly.

« (2) ch[prec:].find(tof) instead of ch.find(tof, prec) »

Yes.

In fact I already knew that the index must be put in find() and not used in the string. I found it and I already used that way in past codes, but I just had a blank in my brain.

By the way, the code is hence simpler. And secondly, as you pointed out, the corrected 'find' solution now runs faster. It runs in T seconds when the regex solution runs in 0.77 * T (0.65 * T before correction):

ch = 'jggBBjgjBBBjhgBBBBjjgBBBBBjjggBBBBBBjjjgjBBBBBBB'
tof = 'BB'
L = len(tof)

X,Y = [],[]
pat = re.compile(tof)

for essay in xrange(5):

    te = clock()
    for i in xrange(1000):
        li = []
        x = ch.find(tof)
        while x+1:
            li.append(x)
            x = ch.find(tof,x+L)
    X.append( clock()-te )


    te = clock()
    for i in xrange(1000):
        ly = [ m.start() for m in pat.finditer(ch) ]
    Y.append( clock()-te )

print li,'\n',ly,'\n'printmin(X),'\n',min(Y)

But it's not such a dramatic acceleration though: 18 % faster.

« (3) your find code considers overlapping matches, the regex code doesn't »

I don't think so. I precisely tried with 'ABBACDABBmmrteyBBgfrewBBBBBioBByt BBB ggddbBB BBbGtBBBGtbBbGT' containing 'BBBBB' to verify that the 'find' solution gives the same result as the 'regex' solution, because I do know that a regex doesn't return overlapping matches. I used L = len(tof) precisely to avoid detection of overlapping slices.

result of the above code:

[3, 8, 14, 16, 21, 23, 30, 32, 34, 41, 43, 45]
[3, 8, 14, 16, 21, 23, 30, 32, 34, 41, 43, 45]

So what does «Your find code considers overlapping matches» mean, please ?

« (4) etc etc »

That's unfair. If there are etceterae, tell which ones. If none... well, I don't know...

I think that a ratio of 1 right comment versus a false one, a dubious one and an unfair one isn't enough to downvote.

Moreover, it appears to me that on SO the imperfect answers written 1 or 2 hours after the question don't escape to downvotes, while the good similar ones are more rarely upvoted because the questionner have had his solution for a long time.

Solution 3:

If you are looking for only one result, use the re.search function instead. That returns a MatchObject (http://docs.python.org/library/re.html#re.MatchObject). The object has start() and end() functions.

Solution 4:

You could try timing the following to see if it is faster than the regex approach:

defmultifind(needle, haystack, overlap=False):
    delta = 1if overlap elsemax(1, len(needle))
    pos = 0
    find = haystack.find
    while1:
        pos = find(needle, pos)
        if pos < 0: returnyield pos
        pos += delta

>>> data = 'ABBACDABBmmrteyBBgfrewBBBBBioBBytA BBB ggdAdbBB BBbGtBBABGtbBbGT'>>> list(multifind('A', data))
[0, 3, 6, 33, 42, 55]
>>> list(multifind('BB', data))
[1, 7, 15, 22, 24, 29, 35, 45, 48, 53]
>>> list(multifind('BB', data, overlap=True))
[1, 7, 15, 22, 23, 24, 25, 29, 35, 36, 45, 48, 53]
>>> list(multifind('', 'qwerty'))
[0, 1, 2, 3, 4, 5, 6]
>>>

Solution 5:

Oh, I also knew the use of a generator, and I had forgotten its advantage in such a research. It's a fine remark, this one.

And the use of an alias hayfind = haystack.find to make the code faster is very tricky too.

Now the 'find with generator' method runs in T and 'regex' method runs in only 0.96 * T

ch = 'jggBBjgjBBBjhgBBBBjjgBBBBBjjggBBBBBBjjjgjBBBBBBB'
tof = 'BB'
L = len(tof)

X,Y = [],[]
pat = re.compile(tof)

def multifind(needle, haystack):
    delta = len(needle)
    pos = 0
    hayfind = haystack.findwhile1:
        pos = hayfind(needle, pos)
        if pos < 0:  returnyield pos
        pos += delta

for essay in xrange(20):

    te = clock()
    for i in xrange(10000):
        li = list(multifind(tof,ch))
    X.append( clock()-te )


    te = clock()
    for i in xrange(10000):
        ly = [ m.start() for m in pat.finditer(ch) ]
    Y.append( clock()-te )

print li,'\n',ly,'\n'printmin(X),'\n',min(Y)

The two methods are of equivalent speeds. But I finaly prefer the 'regex' solution because it can be modified to obtain more results if needed (start and end, other conditions...) while 'find' solution only give the positions.

Python Guru