Cosine Similarity Yields 'nan' Values

December 21, 2023 Post a Comment

I was calculating a Cosine Similarity Matrix for sparse vectors, and the elements expected to be float numbers appeared to be 'nan'. 'visits' is a sparse matrix showing how many ti

Solution 1:

Try:

defnorm(x):
    return np.sqrt((x.T*x).A)

I constructed a smaller sample visits matrix, and calculated cosine_distance_matrix with your code. Mine had the diagonal of 1s, and lots of nan on the off diagonal. I choose one of the nan items, and looked the the corresponding i,k calculation.

In [690]: normi_normk = norm(visits[:,i]) * norm(visits[:,k])
In [691]: normi_normk
Out[691]: 
<1x1 sparse matrix of type'<class 'numpy.float64'>'
    with 1 stored elements in Compressed Sparse Column format>
In [692]: normi_normk.A
Out[692]: array([[ 18707.57953344]])

visits is a sparse matrix, so visits[:,i] is also sparse matrix (1 column). Your norm function returns a 1x1 sparse matrix.

For this pair, this dot is 0, but it still a 1x1 sparse matrix:

In [718]: visits[:,i].T.dot(visits[:, k])
Out[718]: 
<1x1 sparse matrix of type'<class 'numpy.int32'>'with0 stored elements in Compressed Sparse Column format>

The division of these sparse matricies is also sparse - and nan.

In [717]: visits[:,i].T.dot(visits[:, k])/normi_normk
Out[717]: matrix([[ nan]])

But if I change normi_normk to a scalar or dense array I get 0

In [722]: visits[:,i].T.dot(visits[:, k])/normi_normk.A
Out[722]: matrix([[ 0.]])

So we have to change this from a matrix/matrix division, to something involving dense arrays or scalars. It can be changed in various ways. Rewriting the norm to handle sparse matrices correctly is one.

In addition I'd suggest using:

(visits[:,i].T*visits[:, k]).A/normi_normk

so that both terms of the division are dense.

Another possibility is to use visits[:,i].A and visits[:,k].A, so the inner loop calculations are done with dense arrays rather than these matrices.

Note that I'm not doing anything advanced or special. I just examined in detail one of the problem calculations, and found the source of the nan.

I would also suggest using np.zeros to initialize the array. I only use ndarray when the normal zeros, ones, empty don't work.

cosine_distance_matrix = np.zeros((visits.shape[1], visits.shape[1]))

In the big picture it would best to avoid looping over i and k, doing everything with matrix products and such. But this fix will get you going.

Python Guru

Cosine Similarity Yields 'nan' Values

Solution 1:

Post a Comment for "Cosine Similarity Yields 'nan' Values"