Merge Pandas Dataframe With Key Duplicates
I have 2 dataframes, both have a key column which could have duplicates, but the dataframes mostly have the same duplicated keys. I'd like to merge these dataframes on that key, bu
Solution 1:
faster again
# using cython in jupyter notebook
# in another cell run `%load_ext Cython`
from collections import defaultdict
import numpy as np
def cg(x):
cnt = defaultdict(lambda: 0)
for j in x.tolist():
cnt[j] += 1
yield cnt[j]
def fastcount(x):
return [i for i in cg(x)]
df1['cc'] = fastcount(df1.key.values)
df2['cc'] = fastcount(df2.key.values)
df1.merge(df2, how='outer').drop('cc', 1)
faster answer; not scalable
def fastcount(x):
unq, inv = np.unique(x, return_inverse=1)
m = np.arange(len(unq))[:, None] == inv
return (m.cumsum(1) * m).sum(0)
df1['cc'] = fastcount(df1.key.values)
df2['cc'] = fastcount(df2.key.values)
df1.merge(df2, how='outer').drop('cc', 1)
old answer
df1['cc'] = df1.groupby('key').cumcount()
df2['cc'] = df2.groupby('key').cumcount()
df1.merge(df2, how='outer').drop('cc', 1)
Solution 2:
df1.set_index('key', inplace=True)
df2.set_index('key', inplace=True)
merged_df = pd.merge(df1, df2, left_index = True, right_index = True, how= 'inner')
merged_df.reset_index('key', drop=False, inplace=True)
Post a Comment for "Merge Pandas Dataframe With Key Duplicates"