Skip to content Skip to sidebar Skip to footer

Scalable Solution For Str.contains With List Of Strings In Pandas

I am parsing a pandas dataframe df1 containing string object rows. I have a reference list of keywords and need to delete every row in df1 containing any word from the reference li

Solution 1:

For a scalable solution, do the following -

  1. join the contents of words by the regex OR pipe |
  2. pass this to str.contains
  3. use the result to filter df1

To index the 0 column, don't use df1[0] (as this might be considered ambiguous). It would be better to use loc or iloc (see below).

words = ["words", "to", "remove"]
mask = df1.iloc[:, 0].str.contains(r'\b(?:{})\b'.format('|'.join(words)))
df1 = df1[~mask]

Note: This will also work if words is a Series.


Alternatively, if your 0 column is a column of words only (not sentences), then you can use df.isin, which should be faster -

df1 = df1[~df1.iloc[:, 0].isin(words)]

Post a Comment for "Scalable Solution For Str.contains With List Of Strings In Pandas"