I was profiling some code that was surprisingly slow, and it turned out that the source of the issue was a sum over a view into a large sparse matrix. Here is an MWE that recreates it:
d = sprand(Bool,10000,10000, 0.01)
e = view(d, rand(1:10000,5000), rand(1:10000,9000))
using BenchmarkTools
@benchmark sum($d, 1)
BenchmarkTools.Trial:
memory estimate: 78.20 KiB
allocs estimate: 2
--------------
minimum time: 421.354 μs (0.00% GC)
median time: 450.828 μs (0.00% GC)
mean time: 463.575 μs (0.00% GC)
maximum time: 1.226 ms (0.00% GC)
--------------
samples: 10000
evals/sample: 1
@benchmark sum($e, 1)
BenchmarkTools.Trial:
memory estimate: 585.06 KiB
allocs estimate: 51
--------------
minimum time: 3.286 s (0.00% GC)
median time: 3.312 s (0.00% GC)
mean time: 3.312 s (0.00% GC)
maximum time: 3.339 s (0.00% GC)
--------------
samples: 2
evals/sample: 1
That’s a slowdown by a factor of 7000.
Is this 1) expected behaviour and I’m best off not using views into sparse matrices?; 2) am I doing something wrong?; or 3) is it a bug?
I would think this is expected: The sum of a sparse CSC matrix is trivial: just sum the internal vector storing the values of non-zeros. Your sum of the view on the other hand, uses the generic sum function: sum(A::AbstractArray, region) at reducedim.jl:320. This will look up each index to sum it, which for a CSC sparse matrix is two pointer indirections, most of which will lead to “zero”. Thus 7000x slowdown seems about right. I suspect to make a specialized sum function specialized to random access sparse matrix will be hard and not much more efficient (although I might be wrong).
Depending on what your real use case is, if you’re trying to sum elements only within some contiguous subrange of rows and columns, it wouldn’t be hard to write a specialized routine for SparseMatrixCSC storage. Views of sparse matrices are not well optimized since they fall back to generic operations which aren’t aware of the sparsity.
Slicing would make a copy, which you don’t necessarily need. Depends how sensitive you are to the performance here vs how much special-purpose code you want to write for a particular operation.
Yes, I realize it’ll make a copy, but in the end I think it’ll be a lot faster. Though, even though they are sparse 1e6 filled elements in the matrix is realistic for my use. I am sensitive to the performance though, the issue is that the elements are never contiguous in my actual use case. I think writing the specialized code is beyond me.
@edit sum(sq) to look at the code, but essentially it boils down to sum(s.nzval). Anyway, have a look at the doc of nzrange for an example of an iteration over the non-zeros of a sparse matrix.
If you use a matrix product instead of a view to compute the sum, it’ll be fast.
d = sprand(Bool,10000,10000, 0.01)
i = rand(1:10000,5000)
j = rand(1:10000,9000)
e = view(d, i, j)
s = sum(e,1)
t = (sparse(ones(Int,length(i)), i, ones(Int,length(i)),1,10000)*d)[j]
@assert s[:] == t
It still might be worth playing with—in particular, it might be worth “finishing” (I haven’t had the time lately) and registering as a package, even if it’s not as fast as it could be someday. If anyone wants to join the fun, I’d be happy to help coach folks through anything that’s not clear.