How To Check If All Elements Of A Numpy Array Are In Another Numpy Array
Solution 1:
I hope I got your question right. At least it works with the problem you described in your question. If the order of the output should stay the same as the input, change the inplace-sort.
The code looks quite ugly, but should perform well and shouldn't be to hard to understand.
Code
import time
import numba as nb
import numpy as np
@nb.njit(fastmath=True,parallel=True)
def filter(A,B):
iFilter=np.zeros(A.shape[0],dtype=nb.bool_)
for i in nb.prange(A.shape[0]):
break_loop=False
for j in range(B.shape[0]):
ind_to_B=0
for k in range(A.shape[1]):
if A[i,k]==B[j,ind_to_B]:
ind_to_B+=1
if ind_to_B==B.shape[1]:
iFilter[i]=True
break_loop=True
break
if break_loop==True:
break
return A[iFilter,:]
Measuring performance
####First call has some compilation overhead####
A=np.random.randint(low=0, high=60, size=300_000*4).reshape(300_000,4)
B=np.random.randint(low=0, high=60, size=30*2).reshape(30,2)
t1=time.time()
#At first sort the arrays
A.sort()
B.sort()
A_filtered=filter(A,B)
print(time.time()-t1)
####Let's measure the second call too####
A=np.random.randint(low=0, high=60, size=300_000*4).reshape(300_000,4)
B=np.random.randint(low=0, high=60, size=30*2).reshape(30,2)
t1=time.time()
#At first sort the arrays
A.sort()
B.sort()
A_filtered=filter(A,B)
print(time.time()-t1)
Results
46ms after the first run on a dual-core Notebook (sorting included)
32ms (sorting excluded)
Solution 2:
I think this should work:
First, encode the data as follows (this assumes a limited number of 'tokens', as your binary scheme also seems to imply):
Make A shape [n_rows, n_tokens], dtype int8, where each element counts the number of tokens. Encode B in the same way, with shape [n_hands, n_tokens]
This allows for a single vectorized expression of your output; matches = (A[None, :, :] >= B[:, None, :]).all(axis=-1). (exactly how to map this matches array to your desired output format is left as an excerise to the reader since the question leaves it undefined for multiple matches).
But we are talking > 10Mbyte of memory per token here. Even with 32 tokens that should not unthinkable; but in a situation like this it tends to be better to not vectorize the loop over n_tokens or n_hands, or both; for loops are fine for small n, or if there is sufficient work to be done in the body, such that the looping overhead is insignificant.
As long as n_tokens and n_hands remain moderate, I think this will be the fastest solution, if staying within the realm of pure python and numpy.
Post a Comment for "How To Check If All Elements Of A Numpy Array Are In Another Numpy Array"