Assistance in avoiding allocations in summations

I need to do several summations of specific matrix column(s). I can’t figure out a way to get rid of temporary allocations.

The code below is my best version, but it leaves a lot to be desired. If I don’t use @view, time and allocations are worse.

specialArray is the array I am after. Each element in it is the result of several summations.
I don’t think any array size patterns should be capitalized on since in the real code everything is varied. I say this because indexArray is always indexing sets of 3, but that is not always the case.

using LinearAlgebra: dot
using BenchmarkTools

################ Setup the problem #################
nCharge = 3000
nMol = 1000

array1 = ones(Float64, nMol,nMol) * 3
array2 = ones(Float64, nCharge,nCharge) * 0.5
array3 = ones(Float64, nCharge) * 0.25
specialArray = zeros(nMol)

indexArray = [0 for i=1:nMol, j=1:2]
iIndex = 1
jIndex = 3
for i=1:nMol
    global iIndex, jIndex
        indexArray[i,1] = iIndex
        indexArray[i,2] = jIndex

        iIndex += 3
        jIndex += 3

end

################ Define the problem #################
function SumThingsUp!(specialArray::Vector, array1::SubArray, array2::SubArray,
                      array3::SubArray, indexArray::SubArray)

    for i = 1:length(specialArray)
        
        st, ed = indexArray[i,1], indexArray[i,2]

        specialArray[i] = sum(sum,@view(array1[i,:])) + 
                          sum(sum,@view(array2[:,st:ed])) +
                          dot(@view(array3[st:ed]) , @view(array3[st:ed]) ) 
    end
    return specialArray
end

################ Run the problem #################
@btime SumThingsUp!($specialArray, $@view(array1[:,:]), $@view(array2[:,:]), $@view(array3[:]), $@view(indexArray[:,:]))

Each element in specialArray for this MWE should be 7500.1875. The fact that all indices are the same is not a reflection of the real problem, but the nature of the MWE.

Currently @btime produces 15.488 ms (4000 allocations: 203.13 KiB)

Should I just make a function with for loops for doing the summations manually?

I have given explicitly writing the summations in function with for loops and while it cuts down on the allocations a little it makes the time worse by a proportional amount that is saves in allocations…

function Sumat(array::SubArray)
    sumat = 0.0
    for i in eachindex(array)
        sumat += array[i]
    end
    return sumat
end

function Dotat(array)
    dotat = 0.0
    for i in eachindex(array)
        dotat += array[i] * array[i]
    end
    return dotat
end

function SumThingsUp!(specialArray::Vector{Float64}, array1::SubArray{Float64}, array2::SubArray{Float64},
                      array3::SubArray{Float64}, indexArray::SubArray{Int64})
    st = 0
    ed = 0
    for i = 1:length(specialArray)
        
        st, ed = indexArray[i,1], indexArray[i,2]

        specialArray[i] = Sumat( @view(array1[i,:]) ) + #sum(sum,@view(array1[i,:])) + 
                          Sumat( @view(array2[:,st:ed] ) ) + #sum(sum,@view(array2[:,st:ed])) +
                          Dotat( @view(array3[st:ed]) )    # dot(@view(array3[st:ed]) , @view(array3[st:ed]) ) 
    end
    return specialArray
end

@btime 19.108 ms (3000 allocations: 156.25 KiB)

Edit:

Adding @inbounds to the for loops makes allocations go to 0 but the time is still high
@btime 18.263 ms (0 allocations: 0 bytes) this is still alot slower than the first method which has the 4000 allocations.

When I run @profiler on the version that has the sum in it, all of the profiler is consumed by several different simd blocks.

1 Like

You can add @simd to summation loops.

1 Like

Thanks, that gets it down to the same speed as the sum and dot method!

15.053 ms (0 allocations: 0 bytes)

Is there a way to do this without writing my own devectorized summation functions?

I think your Sumat is almost precisely sum, only 10x slower (because it lacks @inbounds):

julia> aa = view(rand(100,100),:,1:99);

julia> @btime sum($aa)
  935.621 ns (0 allocations: 0 bytes)
4951.355345450943

julia> @btime Sumat($aa)
  11.050 μs (0 allocations: 0 bytes)
4951.355345450939

There is no need for sum(sum,… that I can see, as you don’t have arrays of arrays. So I think you could just write this:

@inbounds @views for i = 1:length(specialArray)
    st, ed = indexArray[i,:]
    specialArray[i] = sum(array1[i,:]) + sum(array2[:,st:ed]) + dot(array3[st:ed], array3[st:ed]) 
end

or this

sum!(specialArray, array1)
@inbounds @views for i = 1:length(specialArray)
    st, ed = indexArray[i,:]
    specialArray[i] += sum(array2[:,st:ed]) + dot(array3[st:ed], array3[st:ed]) 
end

I presume the real indexArray is more complecated, too? Otherwise it might be better to work out st, ed directly.

Note aside that you can just write this:

indexArray = [ (1:3:3nMol-2) (3:3:3nMol) ]

Hi @improbable22 ,

Yes indexArray cannot be assumed to be in increments of 3 in the real code.

When I use only sum() the results are both slightly slower than manually doing the summation as I have done in Sumat, and it also generates 5000 allocations (1000 is from using st, ed = indexArray[i,:]).

The devectorized for loops for summations are sped up both by using @inbounds and @simd. With these decorators the time is almost always less that 15 ms for my example, often at 14.5 ms. Using my original code or your modification the results are almost always near 16 ms and always have 4000 allocations or more. This is not a big difference in speed, what bothered me was the allocations.

I did not realize I could use @view as a for loop decorator… that really cleaned the code up!

There must be a way to use the built in sum function and not have the allocations. They seem to add about 7% increase in run time.

OK, maybe something is making Sumat work better in your example than in my toy timing. But I don’t see what you mean by devectorised, it’s just a different implementation of the same thing. (Almost — I think sum adds pairwise for accuracy reasons, sometimes.)

If you wanted to remove all allocations, then don’t make views and pass them around. Just add an inner loop over j with += array1[i,j], and I guess a 3rd loop for k=st:ed with += array2[j,k] + array3[j,k]^2. Probably you will want to keep adding to some accumulator variable which is written into specialArray[i] at the end.

Just make sure the 1ms improvement is going to add up to more than your time writing all this out… and decoding it again when you wonder what it’s all doing!

1 Like

1ms will save about 2-5 hours per simulation. This routine gets called milIions of times. I was hoping there was a way to avoid explicit for loops since I consider them ugly to look at :S That said, being an ex-Fortran guy, I consider them easy to follow and understand…

OK then you should absolutely just write the loops. And also look into @threads on the outermost if possible — which also tends to really like having zero allocations inside.

2 Likes