Summing a vector of vectors in a non-allocating way

What’s going wrong here, I get a DimensionMismatch error when I run this:

input = repeat([ones(2)], 4)
output = zeros(2)
sum(input) # Gives a 2-element Vector{Float64}
sum!(output, input) # Gives a DimensionMismatch error

The error is:

DimensionMismatch("reduction on array with indices (Base.OneTo(4),) with output with indices (Base.OneTo(2),)")

I can write a fix function for my specific case:

function sum_fixed!(output, input::Vector{Vector{T}}, n, m) where T
    for j in 1:m
        for i in 1:n
            output[j] += input[i][j]
        end
    end
end

which is (sometimes) non-allocating:

@btime sum_fixed!(output, input, 4, 2)
34.747 ns (0 allocations: 0 bytes)

but I’m still curious as to why the first doesn’t work and why my function sometimes allocates memory as in the case of my function in the context that I want to use it:

using BenchmarkTools

nagents = 4
f_αβ = reshape(repeat([[0.0, 0.0]], nagents * nagents), nagents, nagents)
sum_f_αβ = [0.0, 0.0]
F_α0 = repeat([[0.0, 0.0]], nagents)

@btime begin
    for i = 1:$nagents
        sum_fixed!($sum_f_αβ, $f_αβ[:, i], $nagents, $2) # 4 (nagents) allocations
        @. $F_α[i] = $F_α0[i] + $sum_f_αβ # 0 allocations
    end
end

which then gives

297.729 ns (4 allocations: 448 bytes)

Can anyone tell me why there are 4 allocations in that loop?

For the first question, the sum! doc suggests that input should be transformed first:

sum!(output, hcat(input...))

I would not do that (since I learned that last week :smile:) because hcat(input...) also allocates quite a bit

julia> using BenchmarkTools

julia> input = repeat([ones(2)], 100000); # make it larger

julia> output = zeros(2);

julia> @btime sum!($output, hcat($input...))
  1.884 ms (7 allocations: 3.05 MiB)
2-element Vector{Float64}:
 100000.0
 100000.0

julia> @btime sum!($output, reduce(hcat, input))
  801.142 μs (2 allocations: 1.53 MiB)
2-element Vector{Float64}:
 100000.0
 100000.0

Certainly the different approach would be to use a matrix instead of a Vector{Vector{Float32}}
Is that possible?

Probably like this

julia> input = repeat([1.0, 2.0], outer=(1,2))
2×2 Matrix{Float64}:
 1.0  1.0
 2.0  2.0

julia> output = sum(input, dims=2) # to see the shape, it is 2x1!
2×1 Matrix{Float64}:
 2.0
 4.0

julia> @btime sum!($output, $input)
  14.685 ns (0 allocations: 0 bytes)
2×1 Matrix{Float64}:
 2.0
 4.0

I don’t know why sum! doesn’t work in this case.
Here is an allocation-free solution without manually implementing the sum:

using SplitApplyCombine
sum!(output, combinedimsview(input))

It has a somewhat worse perfromance than your sum_fixed! though, about a factor of 1.5 for me.

In any case, be aware of this:

julia> x = repeat([[0,0]],3)
3-element Vector{Vector{Int64}}:
 [0, 0]
 [0, 0]
 [0, 0]

julia> x[1][1] = 1
1

julia> x
3-element Vector{Vector{Int64}}:
 [1, 0]
 [1, 0]
 [1, 0]


use a comprehension probably:

julia> x = [ [0,0] for _ in 1:3 ]
3-element Vector{Vector{Int64}}:
 [0, 0]
 [0, 0]
 [0, 0]

julia> x[1][1] = 1
1

julia> x
3-element Vector{Vector{Int64}}:
 [1, 0]
 [0, 0]
 [0, 0]


Oh my goodness I had no idea that this was the behaviour…

I’ve been looking for the bug in my code for days and it was caused by this all along.

Using an array of Static Vectors as input, a performant solution could also be obtained:

using StaticArrays

sum_reinterpret!(output, input, n, m) = sum!(output, reshape(reinterpret(Float64,input), (n, m)))

n = 2; m = 1_000
input = [ones(SVector{n, Float64}) for _ in 1:m]
output = zeros(n)
 
using BenchmarkTools
@btime sum_reinterpret!($output, $input, $n, $m)  # 2.333 μs (0 allocations: 0 bytes)

# which compares to the loops on standard array of arrays input:
@btime sum_fixed!($output, $input, 1_000, 2)      # 2.089 μs (0 allocations: 0 bytes)

Can anyone tell me why there are 4 allocations in that loop?

To partially answer my own question, it’s because doing vec[i][j] is an allocation from j.

vec[i] doesn’t allocate, but accessing the jth element sometimes does.

f_αβ[:, i] also seems to allocate because it constructs a vector when called.

Not entirely allocation free but perhaps still useful:

using RecursiveArrayTools
sum!(output, VectorOfArray(input))

Just need to modify your code to use a view at this point:
sum_fixed!(sum_f_αβ, view(f_αβ, :, i), nagents, 2)

and maybe relax the type definition in sum_fixed!()

Then 0 allocations are obtained.
function sum_fixed!(output, input, n, m)
    for j in 1:m, i in 1:n
        output[j] += input[i][j]
    end
end

function jac!(sum_f_αβ, F_α, F_α0, f_αβ, nagents)
    for i in 1:nagents
        sum_fixed!(sum_f_αβ, view(f_αβ, :, i), nagents, 2)
        @. F_α[i] = F_α0[i] + sum_f_αβ
    end
end

nagents = 4
f_αβ = [[0.0, 0.0] for _ in 1:nagents, _ in 1:nagents]
sum_f_αβ = [0.0, 0.0]
F_α0 = [[0.0, 0.0] for _ in 1:nagents ]
F_α = copy(F_α0)


using BenchmarkTools
@benchmark jac!(sum_f_αβ, F_α, F_α0, f_αβ, nagents)

       @benchmark jac!(sum_f_αβ, F_α, F_α0, f_αβ, nagents)
BenchmarkTools.Trial: 10000 samples with 957 evaluations.
 Range (min … max):   86.102 ns … 401.985 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):      90.073 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   100.627 ns ±  26.710 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▆▄▁▂▇▃▂▃▂▁▂       ▂▁▁▁                                       ▁
  █████████████▇▇▆▇▇███████▆▇▆▆▆▇▆▆▆▆▅▅▆▇▇▇▇▅▇▆▆▇▇▆▆▆▅▆▆▆▅▆▆▅▄▅ █
  86.1 ns       Histogram: log(frequency) by time        213 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Thanks, I tried views but forgot that running my code outside a function would lead to spurious allocations!

Thanks to everyone for answering :slight_smile: