Skip to content Skip to sidebar Skip to footer

Cosine Similarity On Large Sparse Matrix With Numpy

The code below causes my system to run out of memory before it completes. Can you suggest a more efficient means of computing the cosine similarity on a large matrix, such as the

Solution 1:

Same problem here. I've got a big, non-sparse matrix. It fits in memory just fine, but cosine_similarity crashes for whatever unknown reason, probably because they copy the matrix one time too many somewhere. So I made it compare small batches of rows "on the left" instead of the entire matrix:

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

defcosine_similarity_n_space(m1, m2, batch_size=100):
    assert m1.shape[1] == m2.shape[1]
    ret = np.ndarray((m1.shape[0], m2.shape[0]))
    for row_i inrange(0, int(m1.shape[0] / batch_size) + 1):
        start = row_i * batch_size
        end = min([(row_i + 1) * batch_size, m1.shape[0]])
        if end <= start:
            break# cause I'm too lazy to elegantly handle edge cases
        rows = m1[start: end]
        sim = cosine_similarity(rows, m2) # rows is O(1) size
        ret[start: end] = sim
    return ret

No crashes for me; YMMV. Try different batch sizes to make it faster. I used to only compare 1 row at a time, and it took about 30X as long on my machine.

Stupid yet effective sanity check:

import random
while True:
    m = np.random.rand(random.randint(1, 100), random.randint(1, 100))
    n = np.random.rand(random.randint(1, 100), m.shape[1])
    assert np.allclose(cosine_similarity(m, n), cosine_similarity_n_space(m, n))

Solution 2:

I would run it in chunks like this

from sklearn.metrics.pairwise import cosine_similarity

# Change chunk_size to control resource consumption and speed# Higher chunk_size means more memory/RAM needed but also faster 
chunk_size = 500 
matrix_len = your_matrix.shape[0] # Not sparse numpy.ndarraydefsimilarity_cosine_by_chunk(start, end):
    ifend > matrix_len:end = matrix_len
    return cosine_similarity(X=your_matrix[start:end], Y=your_matrix) # scikit-learn functionfor chunk_start in xrange(0, matrix_len, chunk_size):
    cosine_similarity_chunk = similarity_cosine_by_chunk(chunk_start, chunk_start+chunk_size)
    # Handle cosine_similarity_chunk  ( Write it to file_timestamp and close the file )# Do not open the same file again or you may end up with out of memory after few chunks 

Solution 3:

You're running out of memory because you're trying to store a 65000x65000 matrix. Note that the matrix you're constructing is not sparse at all. np.random.rand generates a random number between 0 and 1. So there aren't enough zeros for csr_matrix to actually compress your data. In fact, there are almost surely no zeros at all.

If you look closely at your MemoryError traceback, you can see that cosine_similarity tries to use the sparse dot product if possible:

MemoryError                  Traceback (most recent calllast)
    887         Y_normalized =normalize(Y, copy=True)
    888--> 889     K = safe_sparse_dot(X_normalized, Y_normalized.T, dense_output=dense_output)890891return K

So the problem isn't with cosine_similarity, it's with your matrix. Try initializing an actual sparse matrix (with 1% sparsity, for example) like this:

>>>a = np.zeros((65000, 10))>>>i = np.random.rand(a.size)>>>a.flat[i < 0.01] = 1# Select 1% of indices and set to 1>>>a = sparse.csr_matrix(a)

Then, on a machine with 32GB RAM (8GB RAM was not enough for me), the following runs with no memory error:

>>> b = cosine_similarity(a)
>>> b
array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  1.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

Post a Comment for "Cosine Similarity On Large Sparse Matrix With Numpy"