Skip to content Skip to sidebar Skip to footer

Merge Pandas Dataframe With Key Duplicates

I have 2 dataframes, both have a key column which could have duplicates, but the dataframes mostly have the same duplicated keys. I'd like to merge these dataframes on that key, bu

Solution 1:

faster again

%%cython
# using cython in jupyter notebook
# in another cell run `%load_ext Cython`
from collections import defaultdict
import numpy as np

def cg(x):
    cnt = defaultdict(lambda: 0)

    for j in x.tolist():
        cnt[j] += 1
        yield cnt[j]


def fastcount(x):
    return [i for i in cg(x)]

df1['cc'] = fastcount(df1.key.values)
df2['cc'] = fastcount(df2.key.values)

df1.merge(df2, how='outer').drop('cc', 1)

faster answer; not scalable

def fastcount(x):
    unq, inv = np.unique(x, return_inverse=1)
    m = np.arange(len(unq))[:, None] == inv
    return (m.cumsum(1) * m).sum(0)

df1['cc'] = fastcount(df1.key.values)
df2['cc'] = fastcount(df2.key.values)

df1.merge(df2, how='outer').drop('cc', 1)

old answer

df1['cc'] = df1.groupby('key').cumcount()
df2['cc'] = df2.groupby('key').cumcount()

df1.merge(df2, how='outer').drop('cc', 1)

enter image description here


Solution 2:

df1.set_index('key', inplace=True)

df2.set_index('key', inplace=True)

merged_df = pd.merge(df1, df2, left_index = True, right_index = True, how= 'inner')
merged_df.reset_index('key', drop=False, inplace=True)

Post a Comment for "Merge Pandas Dataframe With Key Duplicates"