Finding Nearest Items Across Two Lists/arrays In Python

August 10, 2022 Post a Comment

I have two numpy arrays x and y containing float values. For each value in x, I want to find the closest element in y, without reusing elements from y. The output should be a 1-1 m

Solution 1:

EDIT 2 A solution using KDTree can perform very well if you can choose a number of neighbors that guarantees that you will have a unique neighbor for every item in your array. With the following code:

def nearest_neighbors_kd_tree(x, y, k) :
    x, y = map(np.asarray, (x, y))
    tree =scipy.spatial.cKDTree(y[:, None])    
    ordered_neighbors = tree.query(x[:, None], k)[1]
    nearest_neighbor = np.empty((len(x),), dtype=np.intp)
    nearest_neighbor.fill(-1)
    used_y = set()
    for j, neigh_j in enumerate(ordered_neighbors) :
        for k in neigh_j :
            if k not in used_y :
                nearest_neighbor[j] = k
                used_y.add(k)
                break
    return nearest_neighbor

and a sample of n=1000 points, I get:

In [9]: np.any(nearest_neighbors_kd_tree(x, y, 12) == -1)
Out[9]: True

In [10]: np.any(nearest_neighbors_kd_tree(x, y, 13) == -1)
Out[10]: False

So the optimum is k=13, and then the timing is:

In [11]: %timeit nearest_neighbors_kd_tree(x, y, 13)
100 loops, best of 3: 9.26 ms per loop

But in the worst case, you could need k=1000, and then:

In [12]: %timeit nearest_neighbors_kd_tree(x, y, 1000)
1 loops, best of 3: 424 ms per loop

Which is slower than the other options:

In [13]: %timeit nearest_neighbors(x, y)
10 loops, best of 3: 60 ms per loop

In [14]: %timeit nearest_neighbors_sorted(x, y)
10 loops, best of 3: 47.4 ms per loop

EDIT Sorting the array before searching pays off for arrays of more than 1000 items:

def nearest_neighbors_sorted(x, y) :
    x, y = map(np.asarray, (x, y))
    y_idx = np.argsort(y)
    y = y[y_idx]
    nearest_neighbor = np.empty((len(x),), dtype=np.intp)
    for j, xj in enumerate(x) :
        idx = np.searchsorted(y, xj)
        if idx == len(y) or idx != 0 and y[idx] - xj > xj - y[idx-1] :
            idx -= 1
        nearest_neighbor[j] = y_idx[idx]
        y = np.delete(y, idx)
        y_idx = np.delete(y_idx, idx)
    return nearest_neighbor

With a 10000 element long array:

Baca Juga

In [2]: %timeit nearest_neighbors_sorted(x, y)
1 loops, best of 3: 557 ms per loop

In [3]: %timeit nearest_neighbors(x, y)
1 loops, best of 3: 1.53 s per loop

For smaller arrays it performs slightly worse.

You are going to have to loop over all your items to implement your greedy nearest neighbor algorithm, if only to discard duplicates. With that in mind, this is the fastest I have been able to come up with:

def nearest_neighbors(x, y) :
    x, y = map(np.asarray, (x, y))
    y = y.copy()
    y_idx = np.arange(len(y))
    nearest_neighbor = np.empty((len(x),), dtype=np.intp)
    for j, xj in enumerate(x) :
        idx = np.argmin(np.abs(y - xj))
        nearest_neighbor[j] = y_idx[idx]
        y = np.delete(y, idx)
        y_idx = np.delete(y_idx, idx)

    return nearest_neighbor

And now with:

n = 1000
x = np.random.rand(n)
y = np.random.rand(2*n)

I get:

In [11]: %timeit nearest_neighbors(x, y)
10 loops, best of 3: 52.4 ms per loop

Python Guru

Finding Nearest Items Across Two Lists/arrays In Python

Solution 1:

Post a Comment for "Finding Nearest Items Across Two Lists/arrays In Python"