Cumsum behaviour on sparse arrays

Hi there,

I’m rather new to Julia (and already enjoying it a lot), so please don’t be too hard on me if my topic doesn’t make that much sense. However I think I noticed a somewhat surprising behaviour when cumsum (cumulated sum) is applied to a SparseVector which also leads to a large performance issue in certain cases:

using SparseArrays
prob_vec = sprand(10, 0.2)

10-element SparseVector{Float64,Int64} with 2 stored entries:
[2 ] = 0.459558
[10] = 0.50192

res = cumsum(prob_vec)

10-element SparseVector{Float64,Int64} with 9 stored entries:
[2 ] = 0.459558
[3 ] = 0.459558
[4 ] = 0.459558

[7 ] = 0.459558
[8 ] = 0.459558
[9 ] = 0.459558
[10] = 0.961478

So once cumsum found the first non-zero entry all subsequent indices are not sparse anymore. I understand that this has to happen mathematically, in case you want to check in-between indices like 7. But it might be misleading for some users to call res a SparseVector, letting them believe res is still sparse and fast in subsequent operations.

In my specific case prob_vec is a probability vector (sum(prob_vec)=1) and I was interested in

sel_ind = findfirst(cumsum(prob_vec).>rand())

which was extremely slow compared to (factor 1000 difference on a large prob_vec)

prob_vec_inds, prob_vec_vals = findnz(prob_vec)
prob_cum = cumsum(prob_vec_vals)
sel_ind = prob_vec_inds[findfirst(prob_cum.>rand())]

due to the behaviour of cumsum on large SparseVectors. Maybe there is a way to notify users about this behaviour?

Best

Welcome @mlanghinrichs to the Julia community and the discourse forum.

I agree that returning a sparse vector is not very helpful in a situation in which you know a priori that only the first few elements will be structurally zero. I wonder if one should change the behavior of cumsum for sparse vectors/arrays. On the other hand, if you know that your input is going to be sparse, maybe it’s really better to address your very own needs in your very own code?

Also, your application sounds like you’re computing something like quantiles? Have you checked out the Statistics package(s) for high-level functions that do what you need?

1 Like

Thanks for your answer! I agree, in the end it’s probably code-specific and not really a general issue (but now maybe people will find it helpful who have a similar problem).

Yes, the actual application in the “non-sparse” case is quite common in quantitative biology; the lines above are part of a Gillespie algorithm / stochastic simulation algorithm.

With your hint I actually could speed-up the code a lot more; I can directly sample from prob_vec using the StatsBase package:

using StatsBase
prob_vec_inds, prob_vec_vals = findnz(prob_vec)
sel_ind = sample(prob_vec_inds, Weights(prob_vec_vals))

(adapted from How can I write an arbitrary discrete distribution in Julia? - Stack Overflow)

Best

2 Likes

Glad that it worked out well for you.