Skip to content Skip to sidebar Skip to footer

How To Efficiently Split Scipy Sparse And Numpy Arrays Into Smaller N Unequal Chunks?

After checking the documentation and this question I tried to split a numpy array and a sparse scipy matrices as follows: >>>print(X.shape) (2399, 39999) >>>pri

Solution 1:

Make a sparse matrix:

In [62]: M=(sparse.rand(10,3,.3,'csr')*10).astype(int)
In [63]: M
Out[63]: 
<10x3 sparse matrix of type '<class 'numpy.int32'>'with9 stored elements in Compressed Sparse Row format>
In [64]: M.A
Out[64]: 
array([[0, 7, 0],
       [0, 0, 0],
       [0, 0, 0],
       [0, 0, 0],
       [0, 0, 5],
       [0, 0, 2],
       [0, 0, 6],
       [0, 4, 4],
       [7, 1, 0],
       [0, 0, 2]])

The dense equivalent is easily split. array_split handles unequal chunks, but you can also spell out the split as illustrated in the other answer.

In [65]: np.array_split(M.A, 3)
Out[65]: 
[array([[0, 7, 0],
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0]]), array([[0, 0, 5],
        [0, 0, 2],
        [0, 0, 6]]), array([[0, 4, 4],
        [7, 1, 0],
        [0, 0, 2]])]

In general numpy functions cannot work directly on sparse matrices. They aren't a subclass. Unless the function delegates the action to the array's own method, the function probably won't work. Often the function starts with np.asarray(M), which is not the same as M.toarray() (try it yourself).

But split is nothing more than slicing along the desired axis. I can produce the same 4,2,3 split with:

In [143]: alist = [M[0:4,:], M[4:7,:], M[7:10]]
In [144]: alist
Out[144]: 
[<4x3 sparse matrix of type'<class 'numpy.int32'>'
    with 1 stored elements in Compressed Sparse Row format>,
 <3x3 sparse matrix of type'<class 'numpy.int32'>'
    with 3 stored elements in Compressed Sparse Row format>,
 <3x3 sparse matrix of type'<class 'numpy.int32'>'
    with 5 stored elements in Compressed Sparse Row format>]
In [145]: [m.A for m in alist]
Out[145]: 
[array([[0, 7, 0],
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0]], dtype=int32), array([[0, 0, 5],
        [0, 0, 2],
        [0, 0, 6]], dtype=int32), array([[0, 4, 4],
        [7, 1, 0],
        [0, 0, 2]], dtype=int32)]

The rest is administrative details.

I should add that sparse slices are never views. They are new sparse matrices with their own data attribute.


With the split indexes in a list, we can construct the split list with a simple iteration:

In [146]: idx = [0,4,7,10]
In [149]: alist = []
In [150]: for i in range(len(idx)-1):
     ...:     alist.append(M[idx[i]:idx[i+1]])   

I haven't worked out the details of how to construct idx, though an obvious starting point in the 10, the M.shape[0].

For even splits (that fit)

In [160]: [M[i:i+5,:] for i inrange(0,M.shape[0],5)]
Out[160]: 
[<5x3 sparse matrix of type'<class 'numpy.int32'>'with2 stored elements in Compressed Sparse Row format>,
 <5x3 sparse matrix of type'<class 'numpy.int32'>'with7 stored elements in Compressed Sparse Row format>]

Solution 2:

First, convert scipy.sparse.csr_matrix to numpy ndarray, then pass a list to numpy.split(ary, indices_or_sections, axis=0).

If indices_or_sections is a 1-D array of sorted integers, the entries indicate where along axis the array is split. For example, [2, 3] would, for axis=0, result in ary[:2] ary[2:3] ary[3:]

https://docs.scipy.org/doc/numpy/reference/generated/numpy.split.html

X1,X2,X3=np.split(X.toarray(), [1000,2000])

Post a Comment for "How To Efficiently Split Scipy Sparse And Numpy Arrays Into Smaller N Unequal Chunks?"