# Cumsum behaviour on sparse arrays

Hi there,

I’m rather new to Julia (and already enjoying it a lot), so please don’t be too hard on me if my topic doesn’t make that much sense. However I think I noticed a somewhat surprising behaviour when `cumsum` (cumulated sum) is applied to a `SparseVector` which also leads to a large performance issue in certain cases:

``````using SparseArrays
prob_vec = sprand(10, 0.2)
``````

10-element SparseVector{Float64,Int64} with 2 stored entries:
[2 ] = 0.459558
[10] = 0.50192

`res = cumsum(prob_vec)`

10-element SparseVector{Float64,Int64} with 9 stored entries:
[2 ] = 0.459558
[3 ] = 0.459558
[4 ] = 0.459558

[7 ] = 0.459558
[8 ] = 0.459558
[9 ] = 0.459558
[10] = 0.961478

So once `cumsum` found the first non-zero entry all subsequent indices are not sparse anymore. I understand that this has to happen mathematically, in case you want to check in-between indices like 7. But it might be misleading for some users to call `res` a `SparseVector`, letting them believe res is still sparse and fast in subsequent operations.

In my specific case `prob_vec` is a probability vector (sum(prob_vec)=1) and I was interested in

`sel_ind = findfirst(cumsum(prob_vec).>rand())`

which was extremely slow compared to (factor 1000 difference on a large `prob_vec`)

``````prob_vec_inds, prob_vec_vals = findnz(prob_vec)
prob_cum = cumsum(prob_vec_vals)
sel_ind = prob_vec_inds[findfirst(prob_cum.>rand())]
``````

due to the behaviour of `cumsum` on large `SparseVectors`. Maybe there is a way to notify users about this behaviour?

Best

Welcome @mlanghinrichs to the Julia community and the discourse forum.

I agree that returning a sparse vector is not very helpful in a situation in which you know a priori that only the first few elements will be structurally zero. I wonder if one should change the behavior of `cumsum` for sparse vectors/arrays. On the other hand, if you know that your input is going to be sparse, maybe it’s really better to address your very own needs in your very own code?

Also, your application sounds like you’re computing something like quantiles? Have you checked out the `Statistics` package(s) for high-level functions that do what you need?

1 Like

Thanks for your answer! I agree, in the end it’s probably code-specific and not really a general issue (but now maybe people will find it helpful who have a similar problem).

Yes, the actual application in the “non-sparse” case is quite common in quantitative biology; the lines above are part of a Gillespie algorithm / stochastic simulation algorithm.

With your hint I actually could speed-up the code a lot more; I can directly sample from `prob_vec` using the `StatsBase` package:

``````using StatsBase
prob_vec_inds, prob_vec_vals = findnz(prob_vec)
sel_ind = sample(prob_vec_inds, Weights(prob_vec_vals))
``````

Best

2 Likes

Glad that it worked out well for you.