Finding Nearest Items Across Two Lists/arrays In Python
Solution 1:
EDIT 2 A solution using KDTree
can perform very well if you can choose a number of neighbors that guarantees that you will have a unique neighbor for every item in your array. With the following code:
def nearest_neighbors_kd_tree(x, y, k) :
x, y = map(np.asarray, (x, y))
tree =scipy.spatial.cKDTree(y[:, None])
ordered_neighbors = tree.query(x[:, None], k)[1]
nearest_neighbor = np.empty((len(x),), dtype=np.intp)
nearest_neighbor.fill(-1)
used_y = set()
for j, neigh_j in enumerate(ordered_neighbors) :
for k in neigh_j :
if k not in used_y :
nearest_neighbor[j] = k
used_y.add(k)
break
return nearest_neighbor
and a sample of n=1000
points, I get:
In [9]: np.any(nearest_neighbors_kd_tree(x, y, 12) == -1)
Out[9]: True
In [10]: np.any(nearest_neighbors_kd_tree(x, y, 13) == -1)
Out[10]: False
So the optimum is k=13
, and then the timing is:
In [11]: %timeit nearest_neighbors_kd_tree(x, y, 13)
100 loops, best of 3: 9.26 ms per loop
But in the worst case, you could need k=1000
, and then:
In [12]: %timeit nearest_neighbors_kd_tree(x, y, 1000)
1 loops, best of 3: 424 ms per loop
Which is slower than the other options:
In [13]: %timeit nearest_neighbors(x, y)
10 loops, best of 3: 60 ms per loop
In [14]: %timeit nearest_neighbors_sorted(x, y)
10 loops, best of 3: 47.4 ms per loop
EDIT Sorting the array before searching pays off for arrays of more than 1000 items:
def nearest_neighbors_sorted(x, y) :
x, y = map(np.asarray, (x, y))
y_idx = np.argsort(y)
y = y[y_idx]
nearest_neighbor = np.empty((len(x),), dtype=np.intp)
for j, xj in enumerate(x) :
idx = np.searchsorted(y, xj)
if idx == len(y) or idx != 0 and y[idx] - xj > xj - y[idx-1] :
idx -= 1
nearest_neighbor[j] = y_idx[idx]
y = np.delete(y, idx)
y_idx = np.delete(y_idx, idx)
return nearest_neighbor
With a 10000 element long array:
In [2]: %timeit nearest_neighbors_sorted(x, y)
1 loops, best of 3: 557 ms per loop
In [3]: %timeit nearest_neighbors(x, y)
1 loops, best of 3: 1.53 s per loop
For smaller arrays it performs slightly worse.
You are going to have to loop over all your items to implement your greedy nearest neighbor algorithm, if only to discard duplicates. With that in mind, this is the fastest I have been able to come up with:
def nearest_neighbors(x, y) :
x, y = map(np.asarray, (x, y))
y = y.copy()
y_idx = np.arange(len(y))
nearest_neighbor = np.empty((len(x),), dtype=np.intp)
for j, xj in enumerate(x) :
idx = np.argmin(np.abs(y - xj))
nearest_neighbor[j] = y_idx[idx]
y = np.delete(y, idx)
y_idx = np.delete(y_idx, idx)
return nearest_neighbor
And now with:
n = 1000
x = np.random.rand(n)
y = np.random.rand(2*n)
I get:
In [11]: %timeit nearest_neighbors(x, y)
10 loops, best of 3: 52.4 ms per loop
Post a Comment for "Finding Nearest Items Across Two Lists/arrays In Python"