Skip to content Skip to sidebar Skip to footer

Search For Text Contained In Any Row Of A Pandas DataFrame

I have the following DataFrame pred[['right_context', 'PERC']] Out[247]: right_context PERC 0 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 0.000197 1

Solution 1:

Series.str.contains & str.upper

You cann use Series.str.contains and join the column in _direcciones as one string with | as seperator.

Also important to note that we have to cast the string of dataframe pred to uppercase with str.upper

pred['found?'] = pred['right_context'].str.upper()\
                                      .str.contains('|'.join(_direcciones['Address']))

print(pred)
                          right_context      PERC  found?
0  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.000197   False
1                San Pedro xxxxxxxxxxxx  0.572630    True
2          zxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.572630   False
3             de San Pedro Este parcela  0.572630    True
4   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.035577   False

Only get T & F

pred['found?'] = pred['right_context'].str.upper()\
                                      .str.contains('|'.join(_direcciones['Address']))\
                                      .astype(str).str[:1]

print(pred)
                          right_context      PERC found?
0  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.000197      F
1                San Pedro xxxxxxxxxxxx  0.572630      T
2          zxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.572630      F
3             de San Pedro Este parcela  0.572630      T
4   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.035577      F

Output of '|'.join

'|'.join(_direcciones['Address'])

'SAN PEDRO|bbbbbbbbbbbbbbbbbbbbbb|yyyyyyyyyyyyyyyyyyy'

Solution 2:

Use word boundaries with all strings joined by | with Series.str.contains and parameter case=False:

pat = '|'.join(r"\b{}\b".format(x) for x in _direcciones['entity_content'])
pred['found?'] = pred['right_context'].str.contains(pat, case=False)
print (pred)
                          right_context      PERC  found?
0  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.000197   False
1                San Pedro xxxxxxxxxxxx  0.572630    True
2          zxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.572630   False
3             de San Pedro Este parcela  0.572630    True
4   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.035577   False

If necessary add numpy.where:

pat = '|'.join(r"\b{}\b".format(x) for x in _direcciones['entity_content'])
pred['found?'] = np.where(pred['right_context'].str.contains(pat, case=False), 'T', 'F')
print (pred)
                          right_context      PERC found?
0  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.000197      F
1                San Pedro xxxxxxxxxxxx  0.572630      T
2          zxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.572630      F
3             de San Pedro Este parcela  0.572630      T
4   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.035577      F

Solution 3:

Try this approach, seems to work for me using small data sample:

from pprint import pprint
import numpy as np
import pandas as pd

def main():
    #Sample Data
    df_right = pd.DataFrame({'right_context':'San Jose, San Pedro, San Pedro Este, Santani, Honolulu'.split(','),
                       'PERC': np.arange(5)})
    directions = pd.DataFrame({'address':'SAN PEDRO, Djiloboji, Torres'.split(','),
                       'value': np.arange(3)})
    # generate found result
    found=(df_right['right_context'].str.contains('San Pedro', case=False)).tolist()
    # Insert into original dataframe
    df_right.insert(2,"found",found)
    pprint(df_right)

if __name__== "__main__":
    main()

Output:

     right_context  PERC  found
0         San Jose     0  False
1        San Pedro     1   True
2   San Pedro Este     2   True
3          Santani     3  False
4         Honolulu     4  False

Post a Comment for "Search For Text Contained In Any Row Of A Pandas DataFrame"