Hi there,
I’m rather new to Julia (and already enjoying it a lot), so please don’t be too hard on me if my topic doesn’t make that much sense. However I think I noticed a somewhat surprising behaviour when cumsum
(cumulated sum) is applied to a SparseVector
which also leads to a large performance issue in certain cases:
using SparseArrays
prob_vec = sprand(10, 0.2)
10-element SparseVector{Float64,Int64} with 2 stored entries:
[2 ] = 0.459558
[10] = 0.50192
res = cumsum(prob_vec)
10-element SparseVector{Float64,Int64} with 9 stored entries:
[2 ] = 0.459558
[3 ] = 0.459558
[4 ] = 0.459558
⋮
[7 ] = 0.459558
[8 ] = 0.459558
[9 ] = 0.459558
[10] = 0.961478
So once cumsum
found the first non-zero entry all subsequent indices are not sparse anymore. I understand that this has to happen mathematically, in case you want to check in-between indices like 7. But it might be misleading for some users to call res
a SparseVector
, letting them believe res is still sparse and fast in subsequent operations.
In my specific case prob_vec
is a probability vector (sum(prob_vec)=1) and I was interested in
sel_ind = findfirst(cumsum(prob_vec).>rand())
which was extremely slow compared to (factor 1000 difference on a large prob_vec
)
prob_vec_inds, prob_vec_vals = findnz(prob_vec)
prob_cum = cumsum(prob_vec_vals)
sel_ind = prob_vec_inds[findfirst(prob_cum.>rand())]
due to the behaviour of cumsum
on large SparseVectors
. Maybe there is a way to notify users about this behaviour?
Best