Which is the most efficient way to add matrices/arrays in Julia 1.x?

Adding matrices is very important thing in numerical computations and I fell that discusion of results of benchmarking this in Julia 1.x should take place here. Because I don’t find such topic on Julia Discourse I put what I found as a starting point. I apologized if I just miss this topic already discussed.

Few remarks. First, I was using Julia 1.3.1. Second, I don’t came up with this code, I just did tutorial for DiffEqTutorial.jl section Optimizing DiffEq Code and was unable to understand results that I get.

We are benchmarking functions:

test1(A, B, C) = A + B + C

test2(A, B, C) = map((a, b, c) -> a + b + c, A, B, C)

function test3(A,B,C)
  D = similar(A)
  @inbounds for i in eachindex(A)
    D[i] = A[i] + B[i] + C[i]
  end
  D
end

test4(A, B, C) = A .+ B .+ C

test5(A, B, C) = @. A + B + C

test6!(D, A, B, C) = D .= A .+ B .+ C

test7!(D, A, B, C) = @. D = A + B + C

test8!(D, A, B, C) = map!((a, b, c) -> a + b + c, D, A, B, C)

For this we create matrices

A = rand(1000,1000); B = rand(1000,1000); C = rand(1000,1000); D = zeros(1000,1000)

Resultants for benchmarking with @benchmark run in Jupyter Notebook on 32GB RAM stationary computer.

Function Memory allocated Median of time (ms)
test(...) 7.63 MiB 2.399
test2(...) 7.63 MiB around 2.5
test3(...) 7.63 MiB around 2.5
test4(...) 7.63 MiB 2.421
test5(...) 7.63 MiB around 2.5
test6!(...) 0 bytes around 2.5
test7!(...) 0 bytes around 2.5
test8!(...) 32 bytes 4.029

Resultants for benchmarking with @benchmark run in Jupyter Notebook on 4GB RAM laptop.

Function Memory allocated Median of time (ms)
test(...) 7.63 MiB 4.677
test2(...) 7.63 MiB 7.199
test3(...) 7.63 MiB 4.676
test4(...) 7.63 MiB 5.531
test5(...) 7.63 MiB 4.768
test6!(...) 0 bytes 4.682
test7!(...) 0 bytes 4.936
test8!(...) 32 bytes 11.588

Benchmarks for adding three 1000 x 1000 Float64 matrices in REPL on 32GB RAM stationary computer.

Function Memory allocated Median of time (ms)
test(...) 7.63 MiB 2.521
test2(...) 7.63 MiB 2.529
test3(...) 7.63 MiB 2.503
test4(...) 7.63 MiB 2.517
test5(...) 7.63 MiB 2.529
test6!(...) 0 bytes 2.517
test7!(...) 0 bytes 2.529
test8!(...) 32 bytes 4.355

Benchmarks for adding three 1000 x 1000 Float64 matrices in REPL on 4GB RAM laptop.

Function Memory allocated Median of time (ms)
test(...) 7.63 MiB 4.878
test2(...) 7.63 MiB 7.398
test3(...) 7.63 MiB 4.484
test4(...) 7.63 MiB 4.493
test5(...) 7.63 MiB 4.505
test6!(...) 0 bytes 4.470
test7!(...) 0 bytes 4.490
test8!(...) 32 bytes 10.579

I can’t understand why method 1 is over two time faster than 8, since I read somewhere that 8 should be faster, because it doesn’t allocate memory. Also I don’t understand why 6 and 7 use 0 bytes of memory and 8 use 32.

A few comments,
Method 8 uses an anonymous function (not the best) and it does allocate the running sum before sending it to D.
Methods 6 and 7 are non-allocating because you already allocated it before calling the function (a bit of cheating compared to the other methods such as 3 which have the preallocation being included in the timing.
I don’t think there is a good reason to use anything other than method 1 unless you had already allocated a suitable object which then maybe 6 which can simply reuse it (in-place).
Just in case, do escape the $ arguments for the @benchmark.

1 Like

If you are going to be doing this addition lots of times, it is likely that an inplace version will be faster. You would allocate the result array just once at the beginning. The actual addition operation is not going to be so different, as you saw.

1 Like

I wouldn’t recommend method 1, initially, because it should allocate intermediate temporary arrays. Method 4 should be better. I don’t know why the benchmarks are so similar, perhaps they’re not set up correctly.

1 Like

Anonymous functions used to be slow, but they have had zero overhead since Julia v0.5. In fact, all of the broadcasting solutions are also using anonymous functions internally, so the presence of an anonymous function cannot be the reason test8 is slower.

As for why map! seems slow, I admit I can’t tell. I looked briefly at the code it calls in Base, and I don’t see anything obviously wrong.

4 Likes

I can’t reproduce this kind of huge difference. Are you remembering to interpolate the global arguments with $ into the benchmark expressions? Also, it’s more reliable to report the minimum time than the median or the mean, since the noise is all positive. @btime does the right thing:

julia> @btime test7!($D,$A,$B,$C);
  1.440 ms (0 allocations: 0 bytes)

julia> @btime test8!($D,$A,$B,$C);
  1.866 ms (1 allocation: 32 bytes)

Update: the above numbers were accidentally run with Julia 1.0.4. With 1.4 I get a big regression:

julia> @btime test7!($D,$A,$B,$C);
  1.455 ms (0 allocations: 0 bytes)

julia> @btime test8!($D,$A,$B,$C);
  2.926 ms (1 allocation: 32 bytes)

I reported an issue (https://github.com/JuliaLang/julia/issues/35914).

6 Likes

I was using code from Optimizing DiffEq Code, where there is no $. All relevant code from this page is below.

@benchmark test(A,B,C)

@benchmark test2(A,B,C)

@benchmark test3(A,B,C)

@benchmark test4(A,B,C)

@benchmark test5(A,B,C)

@benchmark test6!(D,A,B,C)

@benchmark test7!(D,A,B,C)

@benchmark test8!(D,A,B,C)

@benchmark A*B

@benchmark mul!(D,A,B)

If this is wrong way to do that I will start issue about that on tutorials GitHub. (My GitHub is a mess and I can’t make pull request without learning git finally). And maybe other benchmarks all also wrong?

Thank you for information that we should take minimum time and especially for this article. I must read it.

Following @Nosferican and @stevengj advice’s I repeat benchmarking using always call with $ interpolation of global arguments.

@benchmark fun($A,$B,$C)

and reporting minimum time not median time. Again, I use Julia 1.3.1.

Results are only for my 4GB RAM laptop, because due to SARS-Cov-2 I’m cut off my stationary computer.

Below is all code that I use, so maybe someone find mistake in it.

using BenchmarkTools

A = rand(1000,1000); B = rand(1000,1000); C = rand(1000,1000); D = zeros(1000,1000);


test(A,B,C) = A + B + C

test2(A,B,C) = map((a,b,c)->a+b+c,A,B,C)

function test3(A,B,C)
    D = similar(A)
    @inbounds for i in eachindex(A)
        D[i] = A[i] + B[i] + C[i]
    end
    D
end

test4(A,B,C) = A .+ B .+ C

test5(A,B,C) = @. A + B + C

test6!(D,A,B,C) = D .= A .+ B .+ C

test7!(D,A,B,C) = @. D = A + B + C

test8!(D,A,B,C) = map!((a,b,c)->a+b+c,D,A,B,C)


@benchmark test($A,$B,$C)

@benchmark test2($A,$B,$C)

@benchmark test3($A,$B,$C)

@benchmark test4($A,$B,$C)

@benchmark test5($A,$B,$C)

@benchmark test6!($D,$A,$B,$C)

@benchmark test7!($D,$A,$B,$C)

@benchmark test6!($D,$A,$B,$C)

Jupyter Notebook results.

Function Memory allocated Minimum time (ms)
test(...) 7.63 MiB 4.463
test2(...) 7.63 MiB 6.867
test3(...) 7.63 MiB 4.412
test4(...) 7.63 MiB 4.418
test5(...) 7.63 MiB 4.469
test6!(...) 0 bytes 4.430
test7!(...) 0 bytes 4.425
test8!(...) 32 bytes 10.395

REPL

Function Memory allocated Minimum time (ms)
test(...) 7.63 MiB 4.479
test2(...) 7.63 MiB 6.939
test3(...) 7.63 MiB 4.485
test4(...) 7.63 MiB 4.497
test5(...) 7.63 MiB 4.491
test6!(...) 0 bytes 4.451
test7!(...) 0 bytes 4.477
test8!(...) 32 bytes 10.403