Skip to content Skip to sidebar Skip to footer

Find Rows In A Dataframe That Partially Match Conditions

Given a DataFrame, whats the best way to find rows in the DataFrame that partially match a list of given values. Currently I have a rows of given values in a DataFrame (df1), I it

Solution 1:

Not sure if there's a way around looping over at least one DataFrame, but here's one option that might speed things up. It does allow for the accidental comparison of FirstName with LastName, though that can be avoided by adding a unique prefix to the values (like '@' for first name and '&' for last name)

import numpy as np

s1 = [set(x) for x in df1.values]
s2 = [set(x) for x in df2.values]
masks = np.reshape([len(x & y) >= 3for x in s1 for y in s2], (len(df1), -1))
concat_all = [df2[m] for m in masks]

Output concat_all

[  FirstName LastName  Birthday  ResidenceZip
 0      John      Doe  1/1/2000         99999
 1      John      Doe  1/1/2000         99999
 2      John     Doex  1/1/2000         99999,
   FirstName LastName  Birthday  ResidenceZip
 5       Rob        A  9/9/2009         19499]

Timings

defAlollz(df1, df2):
    s1 = [set(x) for x in df1.values]
    s2 = [set(x) for x in df2.values]
    masks = np.reshape([len(x & y) >= 3for x in s1 for y in s2], (len(df1), -1))
    concat_all = [df2[m] for m in masks]
    return concat_all

defSharpObject(df1, df2):
    concat_all = []
    for i, row in df1.iterrows():
        c = {'ResidenceZip': row['ResidenceZip'], 'FirstName':row['FirstName'], 
             'LastName': row['LastName'],'Birthday': row['Birthday']}
        df2['count'] = df2.apply(lambda x: partialMatch(x, c), axis = 1)
        x1 = df2[df2['count']>=3]
        concat_all.append(x1)
    return concat_all

%timeit Alollz(df1, df2)
#785 µs ± 5.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit SharpObject(df1, df2)
#3.56 ms ± 44.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

And larger:

# you should never append dfs like this in a loop
for i in range(7):
    df1 = df1.append(df1)
    df2 = df2.append(df2)

%timeit Alollz(df1, df2)
#132 ms ± 248 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit SharpObject(df1, df2)
#6.88 s ± 11.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Solution 2:

Using the numpy isin function:

df1_vals = df1.values
df2_vals = df2.values
df1_rows = range(df1_vals.shape[0])

concat_all = \
    [df2[np.add.reduce(np.isin(df2_vals, df1_vals[row]), axis=1) >= 3] for row in df1_rows]

Here are the dataframes for setup:

df1 = pd.DataFrame({'FirstName': ['John', 'Rob'],
                    'LastName': ['Doe', 'A'],
                    'Birthday': ['1/1/2000', '9/9/2009'],
                    'ResidenceZip': [99999, 19499]})

df2 = pd.DataFrame({'FirstName': ['John', 'John', 'John', 'Joha', 'Joha', 'Rob'],
                    'LastName': ['Doe', 'Doe', 'Doex', 'Doex', 'Doex', 'A'],
                    'Birthday': ['1/1/2000', '1/1/2000', '1/1/2000', '1/1/2000', '9/9/2000', '9/9/2009'],
                    'ResidenceZip': [99999, 99999, 99999, 99999, 99999, 19499]})

Post a Comment for "Find Rows In A Dataframe That Partially Match Conditions"