Cumsum behaviour on sparse arrays

mlanghinrichs · January 28, 2020, 10:13am

Hi there,

I’m rather new to Julia (and already enjoying it a lot), so please don’t be too hard on me if my topic doesn’t make that much sense. However I think I noticed a somewhat surprising behaviour when cumsum (cumulated sum) is applied to a SparseVector which also leads to a large performance issue in certain cases:

using SparseArrays
prob_vec = sprand(10, 0.2)

10-element SparseVector{Float64,Int64} with 2 stored entries:
[2 ] = 0.459558
[10] = 0.50192

res = cumsum(prob_vec)

10-element SparseVector{Float64,Int64} with 9 stored entries:
[2 ] = 0.459558
[3 ] = 0.459558
[4 ] = 0.459558
⋮
[7 ] = 0.459558
[8 ] = 0.459558
[9 ] = 0.459558
[10] = 0.961478

So once cumsum found the first non-zero entry all subsequent indices are not sparse anymore. I understand that this has to happen mathematically, in case you want to check in-between indices like 7. But it might be misleading for some users to call res a SparseVector, letting them believe res is still sparse and fast in subsequent operations.

In my specific case prob_vec is a probability vector (sum(prob_vec)=1) and I was interested in

sel_ind = findfirst(cumsum(prob_vec).>rand())

which was extremely slow compared to (factor 1000 difference on a large prob_vec)

prob_vec_inds, prob_vec_vals = findnz(prob_vec)
prob_cum = cumsum(prob_vec_vals)
sel_ind = prob_vec_inds[findfirst(prob_cum.>rand())]

due to the behaviour of cumsum on large SparseVectors. Maybe there is a way to notify users about this behaviour?

Best

dkarrasch · January 28, 2020, 3:45pm

Welcome @mlanghinrichs to the Julia community and the discourse forum.

I agree that returning a sparse vector is not very helpful in a situation in which you know a priori that only the first few elements will be structurally zero. I wonder if one should change the behavior of cumsum for sparse vectors/arrays. On the other hand, if you know that your input is going to be sparse, maybe it’s really better to address your very own needs in your very own code?

Also, your application sounds like you’re computing something like quantiles? Have you checked out the Statistics package(s) for high-level functions that do what you need?

mlanghinrichs · January 28, 2020, 7:30pm

Thanks for your answer! I agree, in the end it’s probably code-specific and not really a general issue (but now maybe people will find it helpful who have a similar problem).

Yes, the actual application in the “non-sparse” case is quite common in quantitative biology; the lines above are part of a Gillespie algorithm / stochastic simulation algorithm.

With your hint I actually could speed-up the code a lot more; I can directly sample from prob_vec using the StatsBase package:

using StatsBase
prob_vec_inds, prob_vec_vals = findnz(prob_vec)
sel_ind = sample(prob_vec_inds, Weights(prob_vec_vals))

(adapted from How can I write an arbitrary discrete distribution in Julia? - Stack Overflow)

Best

dkarrasch · January 28, 2020, 8:15pm

Glad that it worked out well for you.

Topic		Replies	Views
Splitting and Summing Sparse Arrays General Usage question , sparse	5	812	August 18, 2022
Weighted sum of sparse matrices New to Julia	3	1592	August 27, 2017
Error in mean value of sparse arrays General Usage	5	399	March 31, 2022
How to add a dense and a sparse vector and get a dense vector as result New to Julia	7	1224	June 13, 2021
Why does Julia promote full array to sparse array? Internals & Design	8	939	August 10, 2020

Cumsum behaviour on sparse arrays

Related topics