Rollsum/cumulative optimisation

I was wondering if someone can help me speed up the below function (roll_sum) in anyway? (Multi-threading, better algorithm,…etc).
An easy way would be to remove the copy but I have added it so it is easy to benchmark. It is really the double for loop part that I am trying to optimize.
Thank you for your help.

import Random
A = rand(0:100, 1_000_000, 8)

cvt(x::UInt64) = string(x, base = 10)

function roll_sum(
    len::StepRange{Int64, Int64},
    cts = copy(octs)
    rollSum = UInt64(0)
    @inbounds for j in len
        for i in 1:k
            tmp = cts[j, i]
            cts[j, i] = rollSum
            rollSum += tmp
    return cts

B = A .% UInt64
sz = size(B)

using BenchmarkTools
@btime roll_sum(B, 1:1:sz[1], sz[2]);

# to see the result in human readable format
roll_sum(B, 1:1:sz[1], sz[2]) .|> cvt

Here’s a couple of alternatives:

function roll_sum2(octs, len, k)
    cts = Matrix{UInt64}(undef, length(len), k)
    rollsum = zero(UInt64)
    @inbounds for j in len, i in 1:k
        cts[j, i] = rollsum
        rollsum += octs[j, i]
    return cts

function roll_sum3(octs, len, k)
    cts = permutedims(octs)
    rollsum = zero(UInt64)
    @inbounds for i in 1:k, j in len
        tmp = cts[j, i]
        cts[j, i] = rollsum
        rollsum += tmp
    return cts

The first avoids copying the data (ie using the undef initialiser), the second copies the data into column-major order which is more efficient for Julia (see the docs). I get these results:

  50.147 ms (4 allocations: 61.04 MiB)
  35.773 ms (4 allocations: 61.04 MiB)
  41.850 ms (4 allocations: 61.04 MiB)

which suggests with this data a significant amount of time is spent copying. If you could arrange your data in column-major order (ie the rolling sum is computed down columns and then across rows) you’d get further speedup combining both these approaches.

Thank you but I have added the copy just to be able to benchmark it. But in reality I don’t need to copy it and can do everything inplace. I am more looking for improving the double for loop part by using a better algorithm, multi-threading or any other trick…

If you can’t find anything better, this one fits the bill


Note that this algorithm discards the last element of the input array from the sum, is that what you want? (This is because the first element of the return value is 0 rather than the first element of the input, the latter being the way cumsum works.)

Have you considered just (a) storing octs in the opposite/transposed order so that you can skip the call to permutedims, and (b) then just calling something like:

roll_sum(A) = reshape(cumsum(vec(A)), size(A))

Yes it discard the last element and the first is 0. The answer of @rocco_sprmnt21 is close but does not return the correct result with regards to first and last element.

Thank you @stevengj , can you please provide an example? I am not sure to get what you mean. Thanks.

using ShiftedArrays


but if you can do what @stevengj recommends you save part of the transformations

1 Like

Thank you @rocco_sprmnt21 for your help. I think I will keep searching for something faster.

check this expression


Your previous answer is faster on my computer.

I tried the expression inside a function in the following way.
Here the time I get in a clean REPL session

julia> using BenchmarkTools

julia> import Random

julia> Random.seed!(1234)

julia> A = rand(0:100, 1_000_000, 8);

julia> B = A .% UInt64
1000000×8 Matrix{UInt64}:

julia> using ShiftedArrays

julia> function roll_sum4(B)
roll_sum4 (generic function with 1 method)

julia> @btime   reshape(accumulate(+,ShiftedArray(vec($B),1,default=zero(UInt64)),init=0),8,:)'
  20.844 ms (9 allocations: 61.04 MiB)

julia> @btime roll_sum4($B)
  18.623 ms (9 allocations: 61.04 MiB)

julia> @btime reshape(cumsum(ShiftedArray(vec($B'),1,default=0x0)),8,:)'
  25.406 ms (4 allocations: 61.04 MiB)
1000000×8 adjoint(::Matrix{UInt64}) with eltype UInt64:

1 Like

I will mark this as the solution even if I am not excited about it. Thank you.