Rollsum/cumulative optimisation

gitboy16 · June 1, 2023, 10:09am

Hi,
I was wondering if someone can help me speed up the below function (roll_sum) in anyway? (Multi-threading, better algorithm,…etc).
An easy way would be to remove the copy but I have added it so it is easy to benchmark. It is really the double for loop part that I am trying to optimize.
Thank you for your help.

import Random
Random.seed!(1234)
A = rand(0:100, 1_000_000, 8)

cvt(x::UInt64) = string(x, base = 10)

function roll_sum(
    octs::Matrix{UInt64},
    len::StepRange{Int64, Int64},
    k::Int64,
)
    cts = copy(octs)
    rollSum = UInt64(0)
    @inbounds for j in len
        for i in 1:k
            tmp = cts[j, i]
            cts[j, i] = rollSum
            rollSum += tmp
        end
    end
    return cts
end

B = A .% UInt64
sz = size(B)

using BenchmarkTools
@btime roll_sum(B, 1:1:sz[1], sz[2]);

# to see the result in human readable format
roll_sum(B, 1:1:sz[1], sz[2]) .|> cvt

screw_dog · June 1, 2023, 2:27pm

Here’s a couple of alternatives:

function roll_sum2(octs, len, k)
    cts = Matrix{UInt64}(undef, length(len), k)
    rollsum = zero(UInt64)
    @inbounds for j in len, i in 1:k
        cts[j, i] = rollsum
        rollsum += octs[j, i]
    end
    return cts
end

function roll_sum3(octs, len, k)
    cts = permutedims(octs)
    rollsum = zero(UInt64)
    @inbounds for i in 1:k, j in len
        tmp = cts[j, i]
        cts[j, i] = rollsum
        rollsum += tmp
    end
    return cts
end

The first avoids copying the data (ie using the undef initialiser), the second copies the data into column-major order which is more efficient for Julia (see the docs). I get these results:

  50.147 ms (4 allocations: 61.04 MiB)
  35.773 ms (4 allocations: 61.04 MiB)
  41.850 ms (4 allocations: 61.04 MiB)

which suggests with this data a significant amount of time is spent copying. If you could arrange your data in column-major order (ie the rolling sum is computed down columns and then across rows) you’d get further speedup combining both these approaches.

gitboy16 · June 1, 2023, 2:32pm

Thank you but I have added the copy just to be able to benchmark it. But in reality I don’t need to copy it and can do everything inplace. I am more looking for improving the double for loop part by using a better algorithm, multi-threading or any other trick…

rocco_sprmnt21 · June 1, 2023, 4:03pm

If you can’t find anything better, this one fits the bill

reshape(cumsum(vec(B')),8,:)'

stevengj · June 1, 2023, 4:04pm

Note that this algorithm discards the last element of the input array from the sum, is that what you want? (This is because the first element of the return value is 0 rather than the first element of the input, the latter being the way cumsum works.)

Have you considered just (a) storing octs in the opposite/transposed order so that you can skip the call to permutedims, and (b) then just calling something like:

roll_sum(A) = reshape(cumsum(vec(A)), size(A))

gitboy16 · June 1, 2023, 4:24pm

Yes it discard the last element and the first is 0. The answer of @rocco_sprmnt21 is close but does not return the correct result with regards to first and last element.

gitboy16 · June 1, 2023, 4:25pm

Thank you @stevengj , can you please provide an example? I am not sure to get what you mean. Thanks.

rocco_sprmnt21 · June 1, 2023, 4:45pm

using ShiftedArrays

reshape(cumsum(ShiftedArray(vec($B'),1,default=0x0)),8,:)'

but if you can do what @stevengj recommends you save part of the transformations

gitboy16 · June 2, 2023, 8:59am

Thank you @rocco_sprmnt21 for your help. I think I will keep searching for something faster.

rocco_sprmnt21 · June 2, 2023, 11:47am

check this expression

reshape(accumulate(+,ShiftedArray(vec(B),1,default=zero(UInt64)),init=0),8,:)'

gitboy16 · June 2, 2023, 1:19pm

Your previous answer is faster on my computer.

rocco_sprmnt21 · June 2, 2023, 2:11pm

I tried the expression inside a function in the following way.
Here the time I get in a clean REPL session

julia> using BenchmarkTools

julia> import Random

julia> Random.seed!(1234)
Random.TaskLocalRNG()

julia> A = rand(0:100, 1_000_000, 8);

julia> B = A .% UInt64
1000000×8 Matrix{UInt64}:




julia> using ShiftedArrays

julia> function roll_sum4(B)
           vb=vec(B)
           Out=similar(vb)
           accumulate!(+,Out,ShiftedArray(vb,1,default=zero(UInt64)),init=0)
           reshape(Out,8,:)'
       end
roll_sum4 (generic function with 1 method)

julia> @btime   reshape(accumulate(+,ShiftedArray(vec($B),1,default=zero(UInt64)),init=0),8,:)'
  20.844 ms (9 allocations: 61.04 MiB)


julia> @btime roll_sum4($B)
  18.623 ms (9 allocations: 61.04 MiB)


julia> @btime reshape(cumsum(ShiftedArray(vec($B'),1,default=0x0)),8,:)'
  25.406 ms (4 allocations: 61.04 MiB)
1000000×8 adjoint(::Matrix{UInt64}) with eltype UInt64:

gitboy16 · June 5, 2023, 11:36am

I will mark this as the solution even if I am not excited about it. Thank you.

Topic		Replies	Views
Rolling Sum Statistics question	28	6581	November 7, 2019
A weird operation to optimize General Usage	13	106	May 3, 2025
Reduce memory allocated in array view and in place sum Performance question	12	693	November 10, 2023
Unclear allocation behaviour with built-in sum() Performance question , memory-allocation	4	204	February 2, 2025
Faster sum loop when looping through vector multiple times than once Performance	5	291	September 29, 2023

Rollsum/cumulative optimisation

Related topics