Summing a vector of vectors in a non-allocating way

What’s going wrong here, I get a DimensionMismatch error when I run this:

input = repeat([ones(2)], 4)
output = zeros(2)
sum(input) # Gives a 2-element Vector{Float64}
sum!(output, input) # Gives a DimensionMismatch error

The error is:

DimensionMismatch("reduction on array with indices (Base.OneTo(4),) with output with indices (Base.OneTo(2),)")

I can write a fix function for my specific case:

function sum_fixed!(output, input::Vector{Vector{T}}, n, m) where T
    for j in 1:m
        for i in 1:n
            output[j] += input[i][j]
        end
    end
end

which is (sometimes) non-allocating:

@btime sum_fixed!(output, input, 4, 2)
34.747 ns (0 allocations: 0 bytes)

but I’m still curious as to why the first doesn’t work and why my function sometimes allocates memory as in the case of my function in the context that I want to use it:

using BenchmarkTools

nagents = 4
f_αβ = reshape(repeat([[0.0, 0.0]], nagents * nagents), nagents, nagents)
sum_f_αβ = [0.0, 0.0]
F_α0 = repeat([[0.0, 0.0]], nagents)

@btime begin
    for i = 1:$nagents
        sum_fixed!($sum_f_αβ, $f_αβ[:, i], $nagents, $2) # 4 (nagents) allocations
        @. $F_α[i] = $F_α0[i] + $sum_f_αβ # 0 allocations
    end
end

which then gives

297.729 ns (4 allocations: 448 bytes)

Can anyone tell me why there are 4 allocations in that loop?

For the first question, the sum! doc suggests that input should be transformed first:

sum!(output, hcat(input...))

I would not do that (since I learned that last week :smile:) because hcat(input...) also allocates quite a bit

julia> using BenchmarkTools

julia> input = repeat([ones(2)], 100000); # make it larger

julia> output = zeros(2);

julia> @btime sum!($output, hcat($input...))
  1.884 ms (7 allocations: 3.05 MiB)
2-element Vector{Float64}:
 100000.0
 100000.0

julia> @btime sum!($output, reduce(hcat, input))
  801.142 μs (2 allocations: 1.53 MiB)
2-element Vector{Float64}:
 100000.0
 100000.0

Certainly the different approach would be to use a matrix instead of a Vector{Vector{Float32}}
Is that possible?

Probably like this

julia> input = repeat([1.0, 2.0], outer=(1,2))
2×2 Matrix{Float64}:
 1.0  1.0
 2.0  2.0

julia> output = sum(input, dims=2) # to see the shape, it is 2x1!
2×1 Matrix{Float64}:
 2.0
 4.0

julia> @btime sum!($output, $input)
  14.685 ns (0 allocations: 0 bytes)
2×1 Matrix{Float64}:
 2.0
 4.0

I don’t know why sum! doesn’t work in this case.
Here is an allocation-free solution without manually implementing the sum:

using SplitApplyCombine
sum!(output, combinedimsview(input))

It has a somewhat worse perfromance than your sum_fixed! though, about a factor of 1.5 for me.

In any case, be aware of this:

julia> x = repeat([[0,0]],3)
3-element Vector{Vector{Int64}}:
 [0, 0]
 [0, 0]
 [0, 0]

julia> x[1][1] = 1
1

julia> x
3-element Vector{Vector{Int64}}:
 [1, 0]
 [1, 0]
 [1, 0]


use a comprehension probably:

julia> x = [ [0,0] for _ in 1:3 ]
3-element Vector{Vector{Int64}}:
 [0, 0]
 [0, 0]
 [0, 0]

julia> x[1][1] = 1
1

julia> x
3-element Vector{Vector{Int64}}:
 [1, 0]
 [0, 0]
 [0, 0]


4 Likes

Oh my goodness I had no idea that this was the behaviour…

I’ve been looking for the bug in my code for days and it was caused by this all along.

2 Likes

Using an array of Static Vectors as input, a performant solution could also be obtained:

using StaticArrays

sum_reinterpret!(output, input, n, m) = sum!(output, reshape(reinterpret(Float64,input), (n, m)))

n = 2; m = 1_000
input = [ones(SVector{n, Float64}) for _ in 1:m]
output = zeros(n)
 
using BenchmarkTools
@btime sum_reinterpret!($output, $input, $n, $m)  # 2.333 μs (0 allocations: 0 bytes)

# which compares to the loops on standard array of arrays input:
@btime sum_fixed!($output, $input, 1_000, 2)      # 2.089 μs (0 allocations: 0 bytes)

Can anyone tell me why there are 4 allocations in that loop?

To partially answer my own question, it’s because doing vec[i][j] is an allocation from j.

vec[i] doesn’t allocate, but accessing the jth element sometimes does.

f_αβ[:, i] also seems to allocate because it constructs a vector when called.

Not entirely allocation free but perhaps still useful:

using RecursiveArrayTools
sum!(output, VectorOfArray(input))

Just need to modify your code to use a view at this point:
sum_fixed!(sum_f_αβ, view(f_αβ, :, i), nagents, 2)

and maybe relax the type definition in sum_fixed!()

Then 0 allocations are obtained.
function sum_fixed!(output, input, n, m)
    for j in 1:m, i in 1:n
        output[j] += input[i][j]
    end
end

function jac!(sum_f_αβ, F_α, F_α0, f_αβ, nagents)
    for i in 1:nagents
        sum_fixed!(sum_f_αβ, view(f_αβ, :, i), nagents, 2)
        @. F_α[i] = F_α0[i] + sum_f_αβ
    end
end

nagents = 4
f_αβ = [[0.0, 0.0] for _ in 1:nagents, _ in 1:nagents]
sum_f_αβ = [0.0, 0.0]
F_α0 = [[0.0, 0.0] for _ in 1:nagents ]
F_α = copy(F_α0)


using BenchmarkTools
@benchmark jac!(sum_f_αβ, F_α, F_α0, f_αβ, nagents)

       @benchmark jac!(sum_f_αβ, F_α, F_α0, f_αβ, nagents)
BenchmarkTools.Trial: 10000 samples with 957 evaluations.
 Range (min … max):   86.102 ns … 401.985 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):      90.073 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   100.627 ns ±  26.710 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▆▄▁▂▇▃▂▃▂▁▂       ▂▁▁▁                                       ▁
  █████████████▇▇▆▇▇███████▆▇▆▆▆▇▆▆▆▆▅▅▆▇▇▇▇▅▇▆▆▇▇▆▆▆▅▆▆▆▅▆▆▅▄▅ █
  86.1 ns       Histogram: log(frequency) by time        213 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
1 Like

Thanks, I tried views but forgot that running my code outside a function would lead to spurious allocations!

Thanks to everyone for answering :slight_smile: