Pandas Find Duplicates In Cross Values
I have a dataframe and want to eliminate duplicate rows, that have same values, but in different columns: df = pd.DataFrame(columns=['a','b','c','d'], index=['1','2','3']) df.loc[
Solution 1:
I think you need filter by boolean indexing
with mask created by numpy.sort
with duplicated
, for invert it use ~
:
df = df[~pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated()]
print (df)
a b c d
1 x y e f
3 w v s t
Detail:
print (np.sort(df, axis=1))
[['e''f''x''y']
['e''f''x''y']
['s''t''v''w']]
print (pd.DataFrame(np.sort(df, axis=1), index=df.index))
01231 e f x y
2 e f x y
3 s t v w
print (pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated())
1False2True3False
dtype: boolprint (~pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated())
1True2False3True
dtype: bool
Solution 2:
Here's another solution, with a for loop:
data = df.as_matrix()
new= []
forrowin data:
if notnew:
new.append(row)
else:
if notany([c in nrow for nrow innewfor c inrow]):
new.append(row)
new_df = pd.DataFrame(new, columns=df.columns)
Solution 3:
Use sorting(np.sort
) and then get duplicates(.duplicated()
) out of it.
Later use that duplicates to drop(df.drop
) the required index
import pandas as pd
import numpy as np
df = pd.DataFrame(columns=['a','b','c','d'], index=['1','2','3'])
df.loc['1'] = pd.Series({'a':'x','b':'y','c':'e','d':'f'})
df.loc['2'] = pd.Series({'a':'e','b':'f','c':'x','d':'y'})
df.loc['3'] = pd.Series({'a':'w','b':'v','c':'s','d':'t'})
df_duplicated = pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated()
index_to_drop = [ind for ind inrange(len(df_duplicated)) if df_duplicated[ind]]
df.drop(df.index[df_duplicated])
Post a Comment for "Pandas Find Duplicates In Cross Values"