Find Lots Of String In Text - Python
Solution 1:
A fast solution would be to build a Trie
out of your sentences and convert this trie to a regex. For your example, the pattern would look like this:
(?:bla\ bla|h(?:ave\ a\ tea|y\ i\ m\ luca)|i\ love\ (?:android|ios))
Here's an example on debuggex:
It might be a good idea to add '\b'
as word boundaries, to avoid matching "have a team"
.
You'll need a small Trie script. It's not an official package yet, but you can simply download it here as trie.py
in your current directory.
You can then use this code to generate the trie/regex:
import re
from trie import Trie
to_find_sentences = [
'bla bla',
'have a tea',
'hy i m luca',
'i love android',
'i love ios',
]
trie = Trie()
for sentence in to_find_sentences:
trie.add(sentence)
print(trie.pattern())
# (?:bla\ bla|h(?:ave\ a\ tea|y\ i\ m\ luca)|i\ love\ (?:android|ios))
pattern = re.compile(r"\b" + trie.pattern() + r"\b", re.IGNORECASE)
text = 'i love android and i think i will have a tea with john'
print(re.findall(pattern, text))
# ['i love android', 'have a tea']
You invest some time to create the Trie and the regex, but the processing should be extremely fast.
Here's a related answer (Speed up millions of regex replacements in Python 3) if you want more information.
Note that it wouldn't find overlapping sentences:
to_find_sentences = [
'i love android',
'android Marshmallow'
]
# ...
print(re.findall(pattern, "I love android Marshmallow"))
# ['I love android']
You'd have to modifiy the regex with positive lookaheads to find overlapping sentences.
Post a Comment for "Find Lots Of String In Text - Python"