Python Memoryerror When Doing Fitting With Scikit-learn
Solution 1:
Take a look at this part of your stack trace:
531 def _pre_compute_svd(self, X, y):
532 if sparse.issparse(X) and hasattr(X, 'toarray'):
--> 533 X = X.toarray()
534 U, s, _ = np.linalg.svd(X, full_matrices=0)
535 v = s ** 2
The algorithm you're using relies on numpy's linear algebra routines to do SVD. But those can't handle sparse matrices, so the author simply converts them to regular non-sparse arrays. The first thing that has to happen for this is to allocate an all-zero array and then fill in the appropriate spots with the values sparsely stored in the sparse matrix. Sounds easy enough, but let's math. A float64 (the default dtype, which you're probably using if you don't know what you're using) element takes 8 bytes. So, based on the array shape you've provided, the new zero-filled array will be:
183576 * 101507 * 8 = 149,073,992,256 ~= 150 gigabytes
Your system's memory manager probably took one look at that allocation request and committed suicide. But what can you do about it?
First off, that looks like a fairly ridiculous number of features. I don't know anything about your problem domain or what your features are, but my gut reaction is that you need to do some dimensionality reduction here.
Second, you can try to fix the algorithm's mishandling of sparse matrices. It's choking on numpy.linalg.svd
here, so you might be able to use scipy.sparse.linalg.svds
instead. I don't know the algorithm in question, but it might not be amenable to sparse matrices. Even if you use the appropriate sparse linear algebra routines, it might produce (or internally use) some non-sparse matrices with sizes similar to your data. Using a sparse matrix representation to represent non-sparse data will only result in using more space than you would have originally, so this approach might not work. Proceed with caution.
Solution 2:
The relevant option here is gcv_mode. It can take 3 values: "auto", "svd" and "eigen". By default, it is set to "auto", which has the following behavior: use the svd mode if n_samples > n_features, otherwise use the eigen mode.
Since in your case n_samples > n_features, the svd mode is chosen. However, the svd mode currently doesn't handle sparse data properly. scikit-learn should be fixed to use proper sparse SVD instead of the dense SVD.
As a workaround, I would force the eigen mode by gcv_mode="eigen", since this mode should properly handle sparse data. However, n_samples is quite large in your case. Since the eigen mode builds a kernel matrix (and thus has n_samples ** 2 memory complexity), the kernel matrix may not fit in memory. In that case, I would just reduce the number of samples (the eigen mode can handle very large number of features without problem, though).
In any case, since both n_samples and n_features are quite large, you are pushing this implementation to its limits (even with a proper sparse SVD).
Also see https://github.com/scikit-learn/scikit-learn/issues/1921
Post a Comment for "Python Memoryerror When Doing Fitting With Scikit-learn"