Skip to content Skip to sidebar Skip to footer

Dot Product Of Huge Arrays In Numpy

I have a huge array and I want to calculate dot product with a small array. But I am getting 'array is too big' Is there a work around? import numpy as np eMatrix = np.random.rand

Solution 1:

That error is raised when figuring the total size of the array, if it overflows the native int type, see here for the exact source code line.

For this to happen, regardless of your machine being 64 bits, you are almost certainly running 32 bit versions of Python (and NumPy). You can check if that is the case by doing:

>>> import sys
>>> sys.maxsize
2147483647# <--- 2**31 - 1, on a 64 bit version you would get 2**63 - 1

Then again, you array is "only" 20000000 * 50 = 1000000000, which is just under 2**30. If I try to reproduce your results on a 32-bit numpy, I get a MemoryError:

>>> np.random.random_integers(low=0,high=100,size=(20000000,50))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "mtrand.pyx", line 1420, in mtrand.RandomState.random_integers (numpy\random\mtrand\mtrand.c:12943)
  File "mtrand.pyx", line 938, in mtrand.RandomState.randint (numpy\random\mtrand\mtrand.c:10338)
MemoryError

unless I increase the size beyond the magic 2**31 - 1 threshold

>>> np.random.random_integers(low=0,high=100,size=(2**30, 2))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "mtrand.pyx", line 1420, in mtrand.RandomState.random_integers (numpy\random\mtrand\mtrand.c:12943)
  File "mtrand.pyx", line 938, in mtrand.RandomState.randint (numpy\random\mtrand\mtrand.c:10338)
ValueError: array is too big.

Given the difference in the line numbers in your traceback and mine, I suspect you are using an older version. What does this output on your system:

>>> np.__version__
'1.10.0.dev-9c50f98'

Solution 2:

I think the only "simple" answer is get more RAM.

It took 15GB, but I was able to do this on my macbook.

In [1]: import numpy
In [2]: e = numpy.random.random_integers(low=0, high=100, size=(20000000, 50))
In [3]: p = numpy.random.random_integers(low=0, high=10, size=(50, 50))
In [4]: a = numpy.dot(e, p)
In [5]: a[0]
Out[5]:
array([14753, 12720, 15324, 13588, 16667, 16055, 14144, 15239, 15166,
       14293, 16786, 12358, 14880, 13846, 11950, 13836, 13393, 14679,
       15292, 15472, 15734, 12095, 14264, 12242, 12684, 11596, 15987,
       15275, 13572, 14534, 16472, 14818, 13374, 14115, 13171, 11927,
       14226, 13312, 16070, 13524, 16591, 16533, 15466, 15440, 15595,
       13164, 14278, 13692, 12415, 13314])

A possible solution might be using a sparse matrix and the sparse matrix dot operator.

For example, on my machine constructing just e as a dense matrix used 8GB of ram. Constructing a similar sparse matrix eprime:

In [1]: from scipy.sparse import rand
In [2]: eprime = rand(20000000, 50)

Has neglible cost in terms of memory.

Solution 3:

I believe the answer is you do not have enough RAM and also possibly you are running a 32 bit version of python. Maybe clarify what OS you are running. Many OSes will run both 32 and 64 bit programs.

Post a Comment for "Dot Product Of Huge Arrays In Numpy"