Scalable Solution For Str.contains With List Of Strings In Pandas
I am parsing a pandas dataframe df1 containing string object rows. I have a reference list of keywords and need to delete every row in df1 containing any word from the reference li
Solution 1:
For a scalable solution, do the following -
- join the contents of words by the regex OR pipe
|
- pass this to
str.contains
- use the result to filter
df1
To index the 0 column, don't use df1[0]
(as this might be considered ambiguous). It would be better to use loc
or iloc
(see below).
words = ["words", "to", "remove"]
mask = df1.iloc[:, 0].str.contains(r'\b(?:{})\b'.format('|'.join(words)))
df1 = df1[~mask]
Note: This will also work if words
is a Series.
Alternatively, if your 0 column is a column of words only (not sentences), then you can use df.isin
, which should be faster -
df1 = df1[~df1.iloc[:, 0].isin(words)]
Post a Comment for "Scalable Solution For Str.contains With List Of Strings In Pandas"