Cosine Similarity Yields 'nan' Values
Solution 1:
Try:
defnorm(x):
return np.sqrt((x.T*x).A)
I constructed a smaller sample visits
matrix, and calculated cosine_distance_matrix
with your code. Mine had the diagonal of 1s, and lots of nan
on the off diagonal. I choose one of the nan
items, and looked the the corresponding i,k
calculation.
In [690]: normi_normk = norm(visits[:,i]) * norm(visits[:,k])
In [691]: normi_normk
Out[691]:
<1x1 sparse matrix of type'<class 'numpy.float64'>'
with 1 stored elements in Compressed Sparse Column format>
In [692]: normi_normk.A
Out[692]: array([[ 18707.57953344]])
visits
is a sparse matrix, so visits[:,i]
is also sparse matrix (1 column). Your norm
function returns a 1x1 sparse matrix.
For this pair, this dot
is 0, but it still a 1x1 sparse matrix:
In [718]: visits[:,i].T.dot(visits[:, k])
Out[718]:
<1x1 sparse matrix of type'<class 'numpy.int32'>'with0 stored elements in Compressed Sparse Column format>
The division of these sparse matricies is also sparse - and nan
.
In [717]: visits[:,i].T.dot(visits[:, k])/normi_normk
Out[717]: matrix([[ nan]])
But if I change normi_normk
to a scalar or dense array I get 0
In [722]: visits[:,i].T.dot(visits[:, k])/normi_normk.A
Out[722]: matrix([[ 0.]])
So we have to change this from a matrix/matrix
division, to something involving dense arrays or scalars. It can be changed in various ways. Rewriting the norm
to handle sparse matrices correctly is one.
In addition I'd suggest using:
(visits[:,i].T*visits[:, k]).A/normi_normk
so that both terms of the division are dense.
Another possibility is to use visits[:,i].A
and visits[:,k].A
, so the inner loop calculations are done with dense arrays rather than these matrices.
Note that I'm not doing anything advanced or special. I just examined in detail one of the problem calculations, and found the source of the nan
.
I would also suggest using np.zeros
to initialize the array. I only use ndarray
when the normal zeros
, ones
, empty
don't work.
cosine_distance_matrix = np.zeros((visits.shape[1], visits.shape[1]))
In the big picture it would best to avoid looping over i
and k
, doing everything with matrix products and such. But this fix will get you going.
Post a Comment for "Cosine Similarity Yields 'nan' Values"