Removing Outliers In Each Column (and Corresponding Row)

My Numpy array contains 10 columns and around 2 million rows. Now I need to analyze each column separately, find values which are outliers; and delete the entire corresponding row

Solution 1:

Two very straightforward approaches, the second with a little more sophistication:

arr = np.random.randn(2e6, 10)

defremove_outliers(arr, k):
    mu, sigma = np.mean(arr, axis=0), np.std(arr, axis=0, ddof=1)
    return arr[np.all(np.abs((arr - mu) / sigma) < k, axis=1)]

defremove_outliers_bis(arr, k):
    mask = np.ones((arr.shape[0],), dtype=np.bool)
    mu, sigma = np.mean(arr, axis=0), np.std(arr, axis=0, ddof=1)
    for j inrange(arr.shape[1]):
        col = arr[:, j]
        mask[mask] &= np.abs((col[mask] - mu[j]) / sigma[j]) < k
    return arr[mask]

Performance depends of how many outliers you have:

In [38]: %timeit remove_outliers(arr, 1)
1 loops, best of 3: 1.13 s per loop

In [39]: %timeit remove_outliers_bis(arr, 1)
1 loops, best of 3: 983 ms per loop

In [40]: %timeit remove_outliers(arr, 2)
1 loops, best of 3: 1.21 s per loop

In [41]: %timeit remove_outliers_bis(arr, 2)
1 loops, best of 3: 1.51 s per loop

And of course:

In [42]: np.allclose(remove_outliers(arr, 1), remove_outliers_bis(arr, 1))
Out[42]: True

In [43]: np.allclose(remove_outliers(arr, 2), remove_outliers_bis(arr, 2))
Out[43]: True

I would say that the complication of the second method does not justify its potential speed-up, but YMMV...

Solution 2:

The best-performing solution depends on the relative cost of finding an outlier, deleting a row, and on the frequency of outliers.

If your outlier frequency is not very high, I would do as follows:

  • create a boolean table of outliers (one element for each element in the original table)
  • sum the table along axis (sum of each row)
  • create a new table where there are only the rows where the outlier sum is 0

Deleting rows one-by-one takes a lot of time, and if outlier-finding is not very expensive the extra work due to possible finding of several outliers in the same row is not significant.

As a code this would be something like:

outliers = find_outliers(data)
data_without_outliers = data[outliers.sum(axis=1) == 0]

where find_outliers creates a boolean table of outlier status (i.e. True if the corresponding element in the original array data is an outlier).

My guess is that the performance depends on your outlier-detection algorithm. If you can make it simple and vectorized, then this is fast.

