Skip to content Skip to sidebar Skip to footer

Storing The Result Of Minhash

The result is a fixed number of arrays, let's say lists (all of the same length) in python. One could see it as a matrix too, so in c I would use an array, where every cell would p

Solution 1:

Whatever container you choose, it should contain hash-itemID pairs, and should be indexed or sorted by the hash. Unsorted arrays will not be remotely efficient.

Assuming you're using a decent sized hash and your various hash algorithms are well-implemented, you should be able to just as effectively store all minhashes in a single container, since the chance of collision between a minhash from one algorithm and a minhash from another is negligible, and if any such collision occurs it won't substantially alter the similarity measure.

Using a single container as opposed to multiple reduces the memory overhead for indexing, though it also slightly increases the amount of processing required. As memory is usually the limiting factor for minhash, a single container may be preferable.

Solution 2:

You can store anything you want in a python list: ints, strings, more lists lists, dicts, objects, functions - you name it.

anything_goes_in_here = [1, 'one', lambda one: one / 1, {1: 'one'}, [1, 1]]

So storing a list of lists is pretty straight forward:

>>>list_1 = [1, 2, 3, 4]>>>list_2 = [5, 6, 7, 8]>>>list_3 = [9, 10, 11, 12]>>>list_4 = [13, 14, 15, 16]>>>main_list = [list_1, list_2, list_3, list_4]>>>forlistin main_list:...for num inlist:...print num...
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

If you are looking to store a list of lists where the index is meaningful (meaning the index gives you some information about the data stored there), then this is basically reimplementing a hashmap (dictionary), and while you say it's trivial - using a dictionary sounds like it fits the problem well here.

Post a Comment for "Storing The Result Of Minhash"