# Assistance in avoiding allocations in summations

I need to do several summations of specific matrix column(s). I can’t figure out a way to get rid of temporary allocations.

The code below is my best version, but it leaves a lot to be desired. If I don’t use @view, time and allocations are worse.

`specialArray` is the array I am after. Each element in it is the result of several summations.
I don’t think any array size patterns should be capitalized on since in the real code everything is varied. I say this because indexArray is always indexing sets of 3, but that is not always the case.

``````using LinearAlgebra: dot
using BenchmarkTools

################ Setup the problem #################
nCharge = 3000
nMol = 1000

array1 = ones(Float64, nMol,nMol) * 3
array2 = ones(Float64, nCharge,nCharge) * 0.5
array3 = ones(Float64, nCharge) * 0.25
specialArray = zeros(nMol)

indexArray = [0 for i=1:nMol, j=1:2]
iIndex = 1
jIndex = 3
for i=1:nMol
global iIndex, jIndex
indexArray[i,1] = iIndex
indexArray[i,2] = jIndex

iIndex += 3
jIndex += 3

end

################ Define the problem #################
function SumThingsUp!(specialArray::Vector, array1::SubArray, array2::SubArray,
array3::SubArray, indexArray::SubArray)

for i = 1:length(specialArray)

st, ed = indexArray[i,1], indexArray[i,2]

specialArray[i] = sum(sum,@view(array1[i,:])) +
sum(sum,@view(array2[:,st:ed])) +
dot(@view(array3[st:ed]) , @view(array3[st:ed]) )
end
return specialArray
end

################ Run the problem #################
@btime SumThingsUp!(\$specialArray, \$@view(array1[:,:]), \$@view(array2[:,:]), \$@view(array3[:]), \$@view(indexArray[:,:]))
``````

Each element in `specialArray` for this MWE should be `7500.1875`. The fact that all indices are the same is not a reflection of the real problem, but the nature of the MWE.

Currently `@btime` produces `15.488 ms (4000 allocations: 203.13 KiB)`

Should I just make a function with `for loops` for doing the summations manually?

I have given explicitly writing the summations in function with `for loops` and while it cuts down on the allocations a little it makes the time worse by a proportional amount that is saves in allocations…

``````function Sumat(array::SubArray)
sumat = 0.0
for i in eachindex(array)
sumat += array[i]
end
return sumat
end

function Dotat(array)
dotat = 0.0
for i in eachindex(array)
dotat += array[i] * array[i]
end
return dotat
end

function SumThingsUp!(specialArray::Vector{Float64}, array1::SubArray{Float64}, array2::SubArray{Float64},
array3::SubArray{Float64}, indexArray::SubArray{Int64})
st = 0
ed = 0
for i = 1:length(specialArray)

st, ed = indexArray[i,1], indexArray[i,2]

specialArray[i] = Sumat( @view(array1[i,:]) ) + #sum(sum,@view(array1[i,:])) +
Sumat( @view(array2[:,st:ed] ) ) + #sum(sum,@view(array2[:,st:ed])) +
Dotat( @view(array3[st:ed]) )    # dot(@view(array3[st:ed]) , @view(array3[st:ed]) )
end
return specialArray
end
``````

`@btime 19.108 ms (3000 allocations: 156.25 KiB)`

## Edit:

Adding `@inbounds` to the for loops makes allocations go to 0 but the time is still high
`@btime 18.263 ms (0 allocations: 0 bytes)` this is still alot slower than the first method which has the 4000 allocations.

When I run @profiler on the version that has the `sum` in it, all of the profiler is consumed by several different `simd` blocks.

You can add `@simd` to summation loops.

1 Like

Thanks, that gets it down to the same speed as the `sum` and `dot` method!

`15.053 ms (0 allocations: 0 bytes)`

Is there a way to do this without writing my own devectorized summation functions?

I think your `Sumat` is almost precisely `sum`, only 10x slower (because it lacks `@inbounds`):

``````julia> aa = view(rand(100,100),:,1:99);

julia> @btime sum(\$aa)
935.621 ns (0 allocations: 0 bytes)
4951.355345450943

julia> @btime Sumat(\$aa)
11.050 μs (0 allocations: 0 bytes)
4951.355345450939
``````

There is no need for `sum(sum,…` that I can see, as you don’t have arrays of arrays. So I think you could just write this:

``````@inbounds @views for i = 1:length(specialArray)
st, ed = indexArray[i,:]
specialArray[i] = sum(array1[i,:]) + sum(array2[:,st:ed]) + dot(array3[st:ed], array3[st:ed])
end
``````

or this

``````sum!(specialArray, array1)
@inbounds @views for i = 1:length(specialArray)
st, ed = indexArray[i,:]
specialArray[i] += sum(array2[:,st:ed]) + dot(array3[st:ed], array3[st:ed])
end
``````

I presume the real `indexArray` is more complecated, too? Otherwise it might be better to work out `st, ed` directly.

Note aside that you can just write this:

``````indexArray = [ (1:3:3nMol-2) (3:3:3nMol) ]
``````

Yes `indexArray` cannot be assumed to be in increments of 3 in the real code.

When I use only `sum()` the results are both slightly slower than manually doing the summation as I have done in `Sumat`, and it also generates `5000 allocations` (`1000` is from using `st, ed = indexArray[i,:]`).

The devectorized for loops for summations are sped up both by using `@inbounds` and `@simd`. With these decorators the time is almost always less that `15 ms` for my example, often at `14.5 ms`. Using my original code or your modification the results are almost always near `16 ms` and always have `4000 allocations` or more. This is not a big difference in speed, what bothered me was the allocations.

I did not realize I could use `@view` as a for loop decorator… that really cleaned the code up!

There must be a way to use the built in `sum` function and not have the allocations. They seem to add about 7% increase in run time.

OK, maybe something is making `Sumat` work better in your example than in my toy timing. But I don’t see what you mean by devectorised, it’s just a different implementation of the same thing. (Almost — I think `sum` adds pairwise for accuracy reasons, sometimes.)

If you wanted to remove all allocations, then don’t make views and pass them around. Just add an inner loop over `j` with `+= array1[i,j]`, and I guess a 3rd loop `for k=st:ed` with `+= array2[j,k] + array3[j,k]^2`. Probably you will want to keep adding to some accumulator variable which is written into `specialArray[i]` at the end.

Just make sure the 1ms improvement is going to add up to more than your time writing all this out… and decoding it again when you wonder what it’s all doing!

1 Like

`1ms` will save about 2-5 hours per simulation. This routine gets called milIions of times. I was hoping there was a way to avoid explicit for loops since I consider them ugly to look at :S That said, being an ex-Fortran guy, I consider them easy to follow and understand…

OK then you should absolutely just write the loops. And also look into `@threads` on the outermost if possible — which also tends to really like having zero allocations inside.

2 Likes