Keep All Elements In One List From Another
Solution 1:
Convert the list keep
into a set
, since that will be checked frequently. Iterate over train
, since you want to keep order and repeats. That makes set
not an option. Even if it was, it wouldn't help, since the iteration would have to happen anyway:
keeps = set(keep)
train_keep = [k for k in train if k in keeps]
A lazier, and probably slower version would be something like
train_keep = filter(lambda x: x in keeps, train)
Neither of these options will give you a large speedup you'd probably be better off using numpy or pandas or some other library that implements the loops in C and stores numbers as something simpler than full-blown python objects. Here is a sample numpy solution:
train = np.array([...])
keep = np.array([...])
train_keep = train[np.isin(train, keep)]
This is likely an O(M * N)
algorithm rather than O(M)
set lookup, but if checking N
elements in keep
is faster than a nominally O(1)
lookup, you win.
You can get something closer to O(M log(N))
using sorted lookup:
train = np.array([...])
keep = np.array([...])
keep.sort()
ind = np.searchsorted(keep, train, side='left')
ind[ind == keep.size] -= 1
train_keep = train[keep[ind] == train]
A better alternative might be to append np.inf
or a maximum out-of-bounds integer to the sorted keep
array, so you don't have to distinguish missing from edge elements with extra
at all. Something like np.max(train.max() + 1, keep.max())
would do:
train = np.array([...])
keep = np.array([... 99999])
keep.sort()
ind = np.searchsorted(keep, train, side='left')
train_keep = train[keep[ind] == train]
For random inputs with train.size = 10000
and keep.size = 10
, the numpy method is ~10x faster on my laptop.
Solution 2:
>>>keep_set = set(keep)>>>[val for val in train if val in keep_set]
[1, 3, 4, 3, 1]
Note that if keep
is small, there might not be any performance advantage to converting it to a set
(benchmark to make sure).
Solution 3:
this is an option:
train = [1, 2, 3, 4, 5, 5, 5, 5, 3, 2, 1]
keep = [1, 3, 4]
keep_set = set(keep)
res = [item for item in train if item in keep_set]
# [1, 3, 4, 3, 1]
i use keep_set
in order to speed up the look-up a bit.
Solution 4:
The logic is the same, but give a try, maybe a generator is faster for your case:
defkeep_if_in(to_keep, ary):
for element in ary:
if element in to_keep:
yield element
train = [1, 2, 3, 4, 5, 5, 5, 5, 3, 2, 1]
keep = [1, 3, 4]
train_keep = keep_if_in(set(keep), train)
Finally, convert to a list when required or iterate directly the generator:
print(list(train_keep))
# alternatively, uncomment this and comment out the line above,# it's because a generator can be consumed once# for e in train_keep:# print(e)
Solution 5:
This is a slight expansion of Mad Physicist's clever technique, to cover a situation where the lists contain characters and one of them is a dataframe column (I was trying to find a list of items in a dataframe, including all duplicates, but the obvious answer, mylist.isin(df['col')
removed the duplicates). I adapted his answer to deal with the problem of possible truncation of character data by Numpy.
#Sample dataframe with strings
d = {'train': ['ABC_S8#Q09#2#510a#6','ABC_S8#Q09#2#510l','ABC_S8#Q09#2#510a#6','ABC_S8#Q09#2#510d02','ABC_S8#Q09#2#510c#8y','ABC_S8#Q09#2#510a#6'], 'col2': [1,2,3,4,5,6]}
df = pd.DataFrame(data=d)
keep_list = ['ABC_S8#Q09#2#510a#6','ABC_S8#Q09#2#510b13','ABC_S8#Q09#2#510c#8y']
#Make sure the Numpy datatype accomodates longest string in either list
maxlen = max(len(max(keep_list, key = len)),len(max(df['train'], key = len)))
strtype = '<U'+ str(maxlen)
#Convert lists to Numpy arrays
keep = np.array(keep_list,dtype = strtype)
train = np.array(df['train'],dtype = strtype)
#Algorithm
keep.sort()
ind = np.searchsorted(keep, train, side='left')
ind[ind == keep.size] -= 1
train_keep = df[keep[ind] == df['train']] #reference the original dataframe
I found this to be much faster than other solutions I tried.
Post a Comment for "Keep All Elements In One List From Another"