Scipy.sparse.csr_matrix：如何获取十大值和索引？

csr

在这种情况下，我看不到格式的优点是什么。当然，所有非零值都收集在一个

.data

数组中，相应的列索引在中

.indices

。但是它们处于不同长度的块中。这意味着它们不能并行或以

numpy

数组步幅进行处理。

一种解决方案是将这些块填充为公共长度块。就是

.toarray()

那样然后，您可以使用

argsort(axis=1) orwith

argpartition`找到最大值。

另一个方法是将它们分成行大小的块，并处理每个块。这就是您使用所做的

.getrow

。分解它们的另一种方法是转换为

lil

格式，并处理

.data

and

.rows

数组的子列表。

第三种可能的选择是使用该

ufunc

reduceat

方法。这使您可以将

ufunc

reduction

方法应用于数组的顺序块。建立

ufunc

这样的

np.add

优势。

argsort

不是这样的功能。但是有一种

ufunc

从Python函数构造a的方法，并且可以在常规Python迭代中获得一定的速度。[我需要查看一个最近的SO问题来说明这一点。]

我将通过一个更简单的函数来说明其中的一些问题，即按行求和。

A2

是一个CSR矩阵。

A2.sum(axis=1)  # the fastest compile csr methodA2.A.sum(axis=1)  # same, but with a dense intermediary[np.sum(l.data) for l in A2]  # iterate over the rows of A2[np.sum(A2.getrow(i).data) for i in range(A2.shape[0])]  # iterate with index[np.sum(l) for l in A2.tolil().data]  # sum the sublists of lil formatnp.add.reduceat(A2.data, A2.indptr[:-1])  # with reduceat

A2.sum(axis=1)

被实现为矩阵乘法。这与排序问题无关，但仍然是一种求和问题的有趣方式。请记住，

csr

格式是为有效乘法而开发的。

对于我当前的样本矩阵（为另一个SO稀疏问题创建）

<8x47752 sparse matrix of type '<class 'numpy.float32'>'     with 32 stored elements in Compressed Sparse Row format>

行总和的实现

np.frompyfunc

：

In [741]: def foo(a,b):    return a+b  In [742]: vfoo=np.frompyfunc(foo,2,1)In [743]: timeit vfoo.reduceat(A2.data,A2.indptr[:-1],dtype=object).astype(float)10000 loops, best of 3: 26.2 µs per loop

那是可观的速度。但是我想不出一种编写

argsort

通过还原实现的二进制函数（带2个参数）的方法。因此，这可能是该问题的死胡同。

Scipy.sparse.csr_matrix：如何获取十大值和索引？

面试问答相关栏目本月热门文章