Python Double Free Error For Huge Datasets
Solution 1:
After discussions on the same issue on the Numpy Github page (https://github.com/numpy/numpy/issues/2995) it has been brought to my attention that Numpy/Scipy will not support such a large number of non-zeros in the resulting sparse matrix.
Basically, W
is a sparse matrix, and Q
(or np.log(Q)-1
) is a dense matrix. When multiplying a dense matrix with a sparse one, the resulting product will also be represented in sparse matrix form (which makes a lot of sense). However, note that since I have no zero rows in my W
matrix, the resulting product W*(np.log(Q)-1)
will have nnz > 2^31
(2.2 million multiplied by 2000) and this exceeds the maximum number of elements in a sparse matrix in current versions of Scipy.
At this stage, I'm not sure how else to get this to work, barring a re-implementation in another language. Perhaps it can still be done in Python, but it might be better to just write up a C++ and Eigen implementation.
A special thanks to pv. for helping out on this to pinpoint the exact issue, and thanks to everyone else for the brainstorming!
Solution 2:
Basically,
W
is a sparse matrix, andQ
(ornp.log(Q)-1
) is a dense matrix. When multiplying a dense matrix with a sparse one, the resulting product will also be represented in sparse matrix form (which makes a lot of sense).
I'm probably missing something really obvious here, and will end up looking like an idiot, but…
If Q
is a dense matrix, and you're hoping to store the result as a dense matrix, you probably have enough to hold W
as a dense matrix as well. Which means:
W.todense()*(np.log(Q)-1)
Looking at the details, as you calculated in the comments, this would require 35.8GB of temporary memory. Given that you've got 131GB of data and this "fits comfortably into memory", it seems at least plausible that temporarily using another 35.8GB would be reasonable.
If it's not reasonable, you can always decompose the matrix multiplication yourself. Obviously doing it row by row or column by column would make your entire process much slower (maybe not as much as pushing the process over the edge into swapping, but still maybe far too slow to be acceptable). But doing it in, e.g., a chunk of 1GB worth of rows at a time shouldn't be too bad. That would mean temporary storage on the order of a few GB, and probably only a small slowdown. Of course it's more complicated and ugly code, but not unmanageably so.
Post a Comment for "Python Double Free Error For Huge Datasets"