Summing arrays efficiently

N = 1024 ^ 2;
A = [ones(N), ones(N), ones(N), ones(N)];

The naive way of summing the arrays in A would be to do sum(A). However, note the performance:

julia> @btime sum(A);
  7.968 ms (6 allocations: 24.00 MiB)

It unnecessarily allocates an array of size N for each addition (3x8 MB). Is there a way around this, or is the recommended approach to write a for loop, i.e.:

julia> @btime (S = copy(A[1]); for B in A[2:end]; S .+= B; end; S);
  4.818 ms (13 allocations: 8.00 MiB)

Try sum(sum.(A))

julia> @btime sum(sum.($A))
  1.876 ms (1 allocation: 112 bytes)
4.194304e6

Or, to go completely allocation free

julia> @btime sum(a->sum(a),$A)
  1.873 ms (0 allocations: 0 bytes)
4.194304e6
1 Like

Thanks! But the result I’m after here is an array which is the element wise sum of the 4 other arrays, and not the total sum of the individual elements.

If you don’t mind overwriting A then this would work:

A = [ones(2^10) for i=1:4];
sum(A)'
reduce((x,y) -> begin x .+= y; x end, A)'
A

using BenchmarkTools
@btime sum(A);
@btime reduce((x,y) -> begin x .+= y; x end, $A);

Edit: perhaps I should think more about foldl vs reduce here, but right now I think it’s OK.

Or similarily, but without overwriting A

N = 1024 ^ 2;
A = [ones(N), ones(N), ones(N), ones(N)];
@time r = zeros(size(A[1]));reduce((x,y)->(x .+= y),A, init=r);
 0.002367 seconds (7 allocations: 8.000 MiB)

This version looks a bit more pleasing

N = 1024 ^ 2;
A = [ones(N), ones(N), ones(N), ones(N)];
a ⊕ b = a .+= b
@time r = reduce(⊕, A, init=zeros(size(A[1])))
 0.009119 seconds (15 allocations: 8.001 MiB)

If the operator (⊕, \oplus) looks confusing, I found this interesting operator that might better correspond to what is done ⥆ \leftarrowplus :grin:

2 Likes

Thanks! That should work. I modified it to make it slightly more efficient:

julia> @btime reduce(⊕, view(A,2:length(A)), init=copy(A[1]));
  3.578 ms (8 allocations: 8.00 MiB)

Shouldn’t that be sum(sum, A)?

1 Like

Yes you are right!