Skip to content Skip to sidebar Skip to footer

Trypsin Digest (cleavage) Does Not Work Using Regular Expression

I have trying to code the theoretical tryptic cleavage of protein sequences in Python. The cleavage rule for trypsin is: after R or K, but not before P. (i.e. the trypsin cleaves (

Solution 1:

regexes are nice, but here's a solution that uses regular python. Since you're looking for subsequences in the bases, it makes sense to build this as a generator, which yields the fragments.

example = 'MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK'

def trypsin(bases):
    sub = ''while bases:
        k, r = bases.find('K'), bases.find('R')
        cut = min(k, r)+1if k > 0and r > 0elsemax(k, r)+1sub += bases[:cut]
        bases = bases[cut:]
        ifnot bases or bases[0] != 'P':
            yieldsubsub = ''print list(trypsin(example))

Solution 2:

EDIT With a slight modification your regex works well:

In your comment you mentioned you have multiple sequences in a file (let's call it sequences.dat):

$ cat sequences.dat
MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK
MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK
MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK

>>> withopen('sequences.dat') as f:
    s = f.read()

>>> print(s)
MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK
MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK
MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK

>>> protein = re.sub(r'(?<=[RK])(?=[^P])','\n', s, re.DOTALL)

>>> protein.split()
['MVPPPPSR', 'GGAAKPGQLGR', 'SLGPLLLLLRPEEPEDGDR', 'EICSESK', 'MVPPPPSR', 'GGAAKPGQLGR', 'SLGPLLLLLRPEEPEDGDR', 'EICSESK', 'MVPPPPSR', 'GGAAKPGQLGR', 'SLGPLLLLLRPEEPEDGDR', 'EICSESK']

>>> print protein
MVPPPPSR
GGAAKPGQLGR
SLGPLLLLLRPEEPEDGDR
EICSESK

MVPPPPSR
GGAAKPGQLGR
SLGPLLLLLRPEEPEDGDR
EICSESK

MVPPPPSR
GGAAKPGQLGR
SLGPLLLLLRPEEPEDGDR
EICSESK

Solution 3:

I believe the following regexp will do as you have described:

([KR]?[^P].*?[KR](?!P))

Result below from pythonregexp

>>> regex = re.compile("([KR]?[^P].*?[KR](?!P))")
>>> r = regex.search(string)
>>> r
<_sre.SRE_Match object at 0xb1a9f49eb4111980>
>>> regex.match(string)
<_sre.SRE_Match object at 0xb1a9f49eb4102980>

# List the groups found>>> r.groups()
(u'MVPPPPSR',)

# List the named dictionary objects found>>> r.groupdict()
{}

# Run findall>>> regex.findall(string)
[u'MVPPPPSR', u'GGAAKPGQLGR', u'SLGPLLLLLRPEEPEDGDR', u'EICSESK']

Post a Comment for "Trypsin Digest (cleavage) Does Not Work Using Regular Expression"