Skip to content Skip to sidebar Skip to footer

Python: Remove Duplicates From A Multi-dimensional Array

In Python numpy.unique can remove all duplicates from a 1D array, very efficiently. 1) How about to remove duplicate rows or columns in a 2D array? 2) How about for nD arrays?

Solution 1:

If possible I would use pandas.

In [1]: from pandas import *

In [2]: import numpy as np

In [3]: a = np.array([[1, 1], [2, 3], [1, 1], [5, 4], [2, 3]])

In [4]: DataFrame(a).drop_duplicates().values
Out[4]: 
array([[1, 1],
       [2, 3],
       [5, 4]], dtype=int64)

Solution 2:

The following is another approach which performs much better than for loop. 2s for 10k+100 duplicates.

deftuples(A):
    try: returntuple(tuples(a) for a in A)
    except TypeError: return A

b = set(tuples(a))

The idea inspired by Waleed Khan's first part. So no need for any additional package that is may have further applications. It is also super Pythonic, I guess.

Solution 3:

The numpy_indexed package solves this problem for the n-dimensional case. (disclaimer: I am its author). Infact, solving this problem was the motivation for starting this package; but it has grown to include a lot of related functionality.

import numpy_indexed as npi
a = np.random.randint(0, 2, (3, 3, 3))
print(npi.unique(a))
print(npi.unique(a, axis=1))
print(npi.unique(a, axis=2))

Post a Comment for "Python: Remove Duplicates From A Multi-dimensional Array"