Hi,

I was wondering if someone can help me speed up the below function (roll_sum) in anyway? (Multi-threading, better algorithm,…etc).

An easy way would be to remove the `copy`

but I have added it so it is easy to benchmark. It is really the double for loop part that I am trying to optimize.

Thank you for your help.

```
import Random
Random.seed!(1234)
A = rand(0:100, 1_000_000, 8)
cvt(x::UInt64) = string(x, base = 10)
function roll_sum(
octs::Matrix{UInt64},
len::StepRange{Int64, Int64},
k::Int64,
)
cts = copy(octs)
rollSum = UInt64(0)
@inbounds for j in len
for i in 1:k
tmp = cts[j, i]
cts[j, i] = rollSum
rollSum += tmp
end
end
return cts
end
B = A .% UInt64
sz = size(B)
using BenchmarkTools
@btime roll_sum(B, 1:1:sz[1], sz[2]);
# to see the result in human readable format
roll_sum(B, 1:1:sz[1], sz[2]) .|> cvt
```

Here’s a couple of alternatives:

```
function roll_sum2(octs, len, k)
cts = Matrix{UInt64}(undef, length(len), k)
rollsum = zero(UInt64)
@inbounds for j in len, i in 1:k
cts[j, i] = rollsum
rollsum += octs[j, i]
end
return cts
end
function roll_sum3(octs, len, k)
cts = permutedims(octs)
rollsum = zero(UInt64)
@inbounds for i in 1:k, j in len
tmp = cts[j, i]
cts[j, i] = rollsum
rollsum += tmp
end
return cts
end
```

The first avoids copying the data (ie using the `undef`

initialiser), the second copies the data into column-major order which is more efficient for Julia (see the docs). I get these results:

```
50.147 ms (4 allocations: 61.04 MiB)
35.773 ms (4 allocations: 61.04 MiB)
41.850 ms (4 allocations: 61.04 MiB)
```

which suggests with this data a significant amount of time is spent copying. If you could arrange your data in column-major order (ie the rolling sum is computed down columns and then across rows) you’d get further speedup combining both these approaches.

Thank you but I have added the copy just to be able to benchmark it. But in reality I don’t need to copy it and can do everything inplace. I am more looking for improving the double for loop part by using a better algorithm, multi-threading or any other trick…

If you can’t find anything better, this one fits the bill

```
reshape(cumsum(vec(B')),8,:)'
```

Note that this algorithm discards the last element of the input array from the sum, is that what you want? (This is because the first element of the return value is 0 rather than the first element of the input, the latter being the way `cumsum`

works.)

Have you considered just (a) storing `octs`

in the opposite/transposed order so that you can skip the call to `permutedims`

, and (b) then just calling something like:

```
roll_sum(A) = reshape(cumsum(vec(A)), size(A))
```

Yes it discard the last element and the first is 0. The answer of @rocco_sprmnt21 is close but does not return the correct result with regards to first and last element.

Thank you @stevengj , can you please provide an example? I am not sure to get what you mean. Thanks.

using ShiftedArrays

```
reshape(cumsum(ShiftedArray(vec($B'),1,default=0x0)),8,:)'
```

but if you can do what @stevengj recommends you save part of the transformations

1 Like

Thank you @rocco_sprmnt21 for your help. I think I will keep searching for something faster.

check this expression

```
reshape(accumulate(+,ShiftedArray(vec(B),1,default=zero(UInt64)),init=0),8,:)'
```

Your previous answer is faster on my computer.

I tried the expression inside a function in the following way.

Here the time I get in a clean REPL session

```
julia> using BenchmarkTools
julia> import Random
julia> Random.seed!(1234)
Random.TaskLocalRNG()
julia> A = rand(0:100, 1_000_000, 8);
julia> B = A .% UInt64
1000000×8 Matrix{UInt64}:
julia> using ShiftedArrays
julia> function roll_sum4(B)
vb=vec(B)
Out=similar(vb)
accumulate!(+,Out,ShiftedArray(vb,1,default=zero(UInt64)),init=0)
reshape(Out,8,:)'
end
roll_sum4 (generic function with 1 method)
julia> @btime reshape(accumulate(+,ShiftedArray(vec($B),1,default=zero(UInt64)),init=0),8,:)'
20.844 ms (9 allocations: 61.04 MiB)
julia> @btime roll_sum4($B)
18.623 ms (9 allocations: 61.04 MiB)
julia> @btime reshape(cumsum(ShiftedArray(vec($B'),1,default=0x0)),8,:)'
25.406 ms (4 allocations: 61.04 MiB)
1000000×8 adjoint(::Matrix{UInt64}) with eltype UInt64:
```

1 Like

I will mark this as the solution even if I am not excited about it. Thank you.