Skip to content Skip to sidebar Skip to footer

Pandas Find Duplicates In Cross Values

I have a dataframe and want to eliminate duplicate rows, that have same values, but in different columns: df = pd.DataFrame(columns=['a','b','c','d'], index=['1','2','3']) df.loc[

Solution 1:

I think you need filter by boolean indexing with mask created by numpy.sort with duplicated, for invert it use ~:

df = df[~pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated()]
print (df)
   a  b  c  d
1  x  y  e  f
3  w  v  s  t

Detail:

print (np.sort(df, axis=1))
[['e''f''x''y']
 ['e''f''x''y']
 ['s''t''v''w']]

print (pd.DataFrame(np.sort(df, axis=1), index=df.index))
   01231  e  f  x  y
2  e  f  x  y
3  s  t  v  w

print (pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated())
1False2True3False
dtype: boolprint (~pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated())

1True2False3True
dtype: bool

Solution 2:

Here's another solution, with a for loop:

data = df.as_matrix()
new= []

forrowin data:
    if notnew:
        new.append(row)
    else:
        if notany([c in nrow for nrow innewfor c inrow]):
            new.append(row)
new_df = pd.DataFrame(new, columns=df.columns)

Solution 3:

Use sorting(np.sort) and then get duplicates(.duplicated()) out of it. Later use that duplicates to drop(df.drop) the required index

import pandas as pd
import numpy as np
df = pd.DataFrame(columns=['a','b','c','d'], index=['1','2','3'])
df.loc['1'] = pd.Series({'a':'x','b':'y','c':'e','d':'f'})
df.loc['2'] = pd.Series({'a':'e','b':'f','c':'x','d':'y'})
df.loc['3'] = pd.Series({'a':'w','b':'v','c':'s','d':'t'})

df_duplicated = pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated()
index_to_drop = [ind for ind inrange(len(df_duplicated)) if df_duplicated[ind]]
df.drop(df.index[df_duplicated])

Post a Comment for "Pandas Find Duplicates In Cross Values"