Removing Outliers In Each Column (and Corresponding Row)
Solution 1:
Two very straightforward approaches, the second with a little more sophistication:
arr = np.random.randn(2e6, 10)
defremove_outliers(arr, k):
mu, sigma = np.mean(arr, axis=0), np.std(arr, axis=0, ddof=1)
return arr[np.all(np.abs((arr - mu) / sigma) < k, axis=1)]
defremove_outliers_bis(arr, k):
mask = np.ones((arr.shape[0],), dtype=np.bool)
mu, sigma = np.mean(arr, axis=0), np.std(arr, axis=0, ddof=1)
for j inrange(arr.shape[1]):
col = arr[:, j]
mask[mask] &= np.abs((col[mask] - mu[j]) / sigma[j]) < k
return arr[mask]
Performance depends of how many outliers you have:
In [38]: %timeit remove_outliers(arr, 1)
1 loops, best of 3: 1.13 s per loop
In [39]: %timeit remove_outliers_bis(arr, 1)
1 loops, best of 3: 983 ms per loop
In [40]: %timeit remove_outliers(arr, 2)
1 loops, best of 3: 1.21 s per loop
In [41]: %timeit remove_outliers_bis(arr, 2)
1 loops, best of 3: 1.51 s per loop
And of course:
In [42]: np.allclose(remove_outliers(arr, 1), remove_outliers_bis(arr, 1))
Out[42]: True
In [43]: np.allclose(remove_outliers(arr, 2), remove_outliers_bis(arr, 2))
Out[43]: True
I would say that the complication of the second method does not justify its potential speed-up, but YMMV...
Solution 2:
The best-performing solution depends on the relative cost of finding an outlier, deleting a row, and on the frequency of outliers.
If your outlier frequency is not very high, I would do as follows:
- create a boolean table of outliers (one element for each element in the original table)
- sum the table along axis (sum of each row)
- create a new table where there are only the rows where the outlier sum is 0
Deleting rows one-by-one takes a lot of time, and if outlier-finding is not very expensive the extra work due to possible finding of several outliers in the same row is not significant.
As a code this would be something like:
outliers = find_outliers(data)
data_without_outliers = data[outliers.sum(axis=1) == 0]
where find_outliers
creates a boolean table of outlier status (i.e. True
if the corresponding element in the original array data
is an outlier).
My guess is that the performance depends on your outlier-detection algorithm. If you can make it simple and vectorized, then this is fast.
Post a Comment for "Removing Outliers In Each Column (and Corresponding Row)"