Which is the most efficient way to add matrices/arrays in Julia 1.x?

Adding matrices is very important thing in numerical computations and I fell that discusion of results of benchmarking this in Julia 1.x should take place here. Because I don’t find such topic on Julia Discourse I put what I found as a starting point. I apologized if I just miss this topic already discussed.

Few remarks. First, I was using Julia 1.3.1. Second, I don’t came up with this code, I just did tutorial for DiffEqTutorial.jl section Optimizing DiffEq Code and was unable to understand results that I get.

We are benchmarking functions:

test1(A, B, C) = A + B + C

test2(A, B, C) = map((a, b, c) -> a + b + c, A, B, C)

function test3(A,B,C)
  D = similar(A)
  @inbounds for i in eachindex(A)
    D[i] = A[i] + B[i] + C[i]
  end
  D
end

test4(A, B, C) = A .+ B .+ C

test5(A, B, C) = @. A + B + C

test6!(D, A, B, C) = D .= A .+ B .+ C

test7!(D, A, B, C) = @. D = A + B + C

test8!(D, A, B, C) = map!((a, b, c) -> a + b + c, D, A, B, C)

For this we create matrices

A = rand(1000,1000); B = rand(1000,1000); C = rand(1000,1000); D = zeros(1000,1000)

Resultants for benchmarking with @benchmark run in Jupyter Notebook on 32GB RAM stationary computer.

Function Memory allocated Median of time (ms)
test(...) 7.63 MiB 2.399
test2(...) 7.63 MiB around 2.5
test3(...) 7.63 MiB around 2.5
test4(...) 7.63 MiB 2.421
test5(...) 7.63 MiB around 2.5
test6!(...) 0 bytes around 2.5
test7!(...) 0 bytes around 2.5
test8!(...) 32 bytes 4.029

Resultants for benchmarking with @benchmark run in Jupyter Notebook on 4GB RAM laptop.

Function Memory allocated Median of time (ms)
test(...) 7.63 MiB 4.677
test2(...) 7.63 MiB 7.199
test3(...) 7.63 MiB 4.676
test4(...) 7.63 MiB 5.531
test5(...) 7.63 MiB 4.768
test6!(...) 0 bytes 4.682
test7!(...) 0 bytes 4.936
test8!(...) 32 bytes 11.588

Benchmarks for adding three 1000 x 1000 Float64 matrices in REPL on 32GB RAM stationary computer.

Function Memory allocated Median of time (ms)
test(...) 7.63 MiB 2.521
test2(...) 7.63 MiB 2.529
test3(...) 7.63 MiB 2.503
test4(...) 7.63 MiB 2.517
test5(...) 7.63 MiB 2.529
test6!(...) 0 bytes 2.517
test7!(...) 0 bytes 2.529
test8!(...) 32 bytes 4.355

Benchmarks for adding three 1000 x 1000 Float64 matrices in REPL on 4GB RAM laptop.

Function Memory allocated Median of time (ms)
test(...) 7.63 MiB 4.878
test2(...) 7.63 MiB 7.398
test3(...) 7.63 MiB 4.484
test4(...) 7.63 MiB 4.493
test5(...) 7.63 MiB 4.505
test6!(...) 0 bytes 4.470
test7!(...) 0 bytes 4.490
test8!(...) 32 bytes 10.579

I can’t understand why method 1 is over two time faster than 8, since I read somewhere that 8 should be faster, because it doesn’t allocate memory. Also I don’t understand why 6 and 7 use 0 bytes of memory and 8 use 32.

A few comments,
Method 8 uses an anonymous function (not the best) and it does allocate the running sum before sending it to D.
Methods 6 and 7 are non-allocating because you already allocated it before calling the function (a bit of cheating compared to the other methods such as 3 which have the preallocation being included in the timing.
I don’t think there is a good reason to use anything other than method 1 unless you had already allocated a suitable object which then maybe 6 which can simply reuse it (in-place).
Just in case, do escape the $ arguments for the @benchmark.

2 Likes

If you are going to be doing this addition lots of times, it is likely that an inplace version will be faster. You would allocate the result array just once at the beginning. The actual addition operation is not going to be so different, as you saw.

1 Like

I wouldn’t recommend method 1, initially, because it should allocate intermediate temporary arrays. Method 4 should be better. I don’t know why the benchmarks are so similar, perhaps they’re not set up correctly.

2 Likes

Anonymous functions used to be slow, but they have had zero overhead since Julia v0.5. In fact, all of the broadcasting solutions are also using anonymous functions internally, so the presence of an anonymous function cannot be the reason test8 is slower.

As for why map! seems slow, I admit I can’t tell. I looked briefly at the code it calls in Base, and I don’t see anything obviously wrong.

5 Likes

I can’t reproduce this kind of huge difference. Are you remembering to interpolate the global arguments with $ into the benchmark expressions? Also, it’s more reliable to report the minimum time than the median or the mean, since the noise is all positive. @btime does the right thing:

julia> @btime test7!($D,$A,$B,$C);
  1.440 ms (0 allocations: 0 bytes)

julia> @btime test8!($D,$A,$B,$C);
  1.866 ms (1 allocation: 32 bytes)

Update: the above numbers were accidentally run with Julia 1.0.4. With 1.4 I get a big regression:

julia> @btime test7!($D,$A,$B,$C);
  1.455 ms (0 allocations: 0 bytes)

julia> @btime test8!($D,$A,$B,$C);
  2.926 ms (1 allocation: 32 bytes)

I reported an issue (50% performance regression in map! · Issue #35914 · JuliaLang/julia · GitHub).

6 Likes

I was using code from Optimizing DiffEq Code, where there is no $. All relevant code from this page is below.

@benchmark test(A,B,C)

@benchmark test2(A,B,C)

@benchmark test3(A,B,C)

@benchmark test4(A,B,C)

@benchmark test5(A,B,C)

@benchmark test6!(D,A,B,C)

@benchmark test7!(D,A,B,C)

@benchmark test8!(D,A,B,C)

@benchmark A*B

@benchmark mul!(D,A,B)

If this is wrong way to do that I will start issue about that on tutorials GitHub. (My GitHub is a mess and I can’t make pull request without learning git finally). And maybe other benchmarks all also wrong?

Thank you for information that we should take minimum time and especially for this article. I must read it.

Following @Nosferican and @stevengj advice’s I repeat benchmarking using always call with $ interpolation of global arguments.

@benchmark fun($A,$B,$C)

and reporting minimum time not median time. Again, I use Julia 1.3.1.

Results are only for my 4GB RAM laptop, because due to SARS-Cov-2 I’m cut off my stationary computer.

Below is all code that I use, so maybe someone find mistake in it.

using BenchmarkTools

A = rand(1000,1000); B = rand(1000,1000); C = rand(1000,1000); D = zeros(1000,1000);


test(A,B,C) = A + B + C

test2(A,B,C) = map((a,b,c)->a+b+c,A,B,C)

function test3(A,B,C)
    D = similar(A)
    @inbounds for i in eachindex(A)
        D[i] = A[i] + B[i] + C[i]
    end
    D
end

test4(A,B,C) = A .+ B .+ C

test5(A,B,C) = @. A + B + C

test6!(D,A,B,C) = D .= A .+ B .+ C

test7!(D,A,B,C) = @. D = A + B + C

test8!(D,A,B,C) = map!((a,b,c)->a+b+c,D,A,B,C)


@benchmark test($A,$B,$C)

@benchmark test2($A,$B,$C)

@benchmark test3($A,$B,$C)

@benchmark test4($A,$B,$C)

@benchmark test5($A,$B,$C)

@benchmark test6!($D,$A,$B,$C)

@benchmark test7!($D,$A,$B,$C)

@benchmark test6!($D,$A,$B,$C)

Jupyter Notebook results.

Function Memory allocated Minimum time (ms)
test(...) 7.63 MiB 4.463
test2(...) 7.63 MiB 6.867
test3(...) 7.63 MiB 4.412
test4(...) 7.63 MiB 4.418
test5(...) 7.63 MiB 4.469
test6!(...) 0 bytes 4.430
test7!(...) 0 bytes 4.425
test8!(...) 32 bytes 10.395

REPL

Function Memory allocated Minimum time (ms)
test(...) 7.63 MiB 4.479
test2(...) 7.63 MiB 6.939
test3(...) 7.63 MiB 4.485
test4(...) 7.63 MiB 4.497
test5(...) 7.63 MiB 4.491
test6!(...) 0 bytes 4.451
test7!(...) 0 bytes 4.477
test8!(...) 32 bytes 10.403

I repeated benchmarks for adding 1000 \times 1000 matrices generated by function rand(1000, 1000) in Julia 1.5.3. Resultants were produced with the use @benchmark run in Jupyter Notebook on 4GB RAM laptop and due to advice from Steven G. Johnson now I write down minimal time as main measure. Median time is stated to make compression with previous results easier.

Original code was taken from DiffEqTutorials.jl tutorial.

Function Memory allocated Minimum time (ms) Median time (ms)
test($A, $B, $C) 7.63 MiB around 3.5 around 3.6
test2($A, $B, $C) 7.63 MiB around 3.5 around 3.6
test3($A, $B, $C) 7.63 MiB around 3.5 around 3.6
test4($A, $B, $C) 7.63 MiB around 3.5 around 3.6
test5($A, $B, $C) 7.63 MiB around 3.5 around 3.8
test6!($D, $A, $B, $C) 0 bytes around 3.5 around 3.6
test7!($D, $A, $B, $C) 0 bytes around 3.5 around 3.6
test8!($D, $A, $B, $C) 0 bytes around 6.1 around 6.3

We can see easily than Julia 1.5 is over 20% faster than 1.4 (depending on each test results vary) and there is almost no difference in performance of the most methods, so one can just write A + B + C, which is great!

Only using map! give you 74% penalty in time, but compering to benefits this is not a big problem. I hope that Julia 1.6 will solve this problem, but I don’t dare to work with master branch, not now.

Relevant code from the page.

A = rand(1000,1000); B = rand(1000,1000); C = rand(1000,1000); D = zeros(1000,1000);

test(A, B, C) = A + B + C

test2(A, B, C) = map((a, b, c) -> a + b + c, A, B, C)

function test3(A, B, C)
    D = similar(A)
    @inbounds for i in eachindex(A)
        D[i] = A[i] + B[i] + C[i]
    end
    D
end

test4(A, B, C) = A .+ B .+ C

test5(A, B, C) = @. A + B + C

test6!(D, A, B, C) = D .= A .+ B .+ C

test7!(D, A, B, C) = @. D = A + B + C

test8!(D, A, B, C) = map!((a, b, c) -> a + b + c, D, A, B, C)
1 Like

You don’t need to work with the master branch. There is already a release candidate of 1.6 that will probably become the release.

2 Likes

I follow advise of @dpsanders and repeat tests in Julia 1.6.0-rc3. Since I have no experience with having two Julia version on the same machine, on computers strange things just happens, so better be cautious, I make this benchmarks in REPL on 4 GB RAM laptop.

Function Memory allocated Minimum of time (ms)
test($A, $B, $C) 7.63 MiB around 3.575
test2($A, $B, $C) 7.63 MiB around 3.6
test3($A, $B, $C) 7.63 MiB around 3.6
test4($A, $B, $C) 7.63 MiB around 3.6
test5($A, $B, $C) 7.63 MiB around 3.65
test6!($D, $A, $B, $C) 0 bytes around 3.6
test7!($D, $A, $B, $C) 0 bytes around 3.7
test8!($D, $A, $B, $C) 0 bytes around 6.5

It looks like there is still a penalty for using map!, but hopefully this problem will be resolved soon. I make post about it in GitHub issue.

1 Like

Since official Julia 1.6 is out, I repeated this benchmark using Jupyter Notebook on my 4GB RAM laptop.

Function Memory allocated Minimum time (ms)
test($A, $B, $C) 7.63 MiB 3.581
test2($A, $B, $C) 7.63 MiB 3.586
test3($A, $B, $C) 7.63 MiB 3.573
test4($A, $B, $C) 7.63 MiB 3.552
test5($A, $B, $C) 7.63 MiB 3.602
test6!($A, $B, $C) 0 bytes 3.582
test7!($A, $B, $C) 0 bytes 3.604
test8!($A, $B, $C) 0 bytes 6.363

It looks that something is still problem with map!, but I can live with that, since A + B + C is very fast.

Information about my settings.

Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-3210M CPU @ 2.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, ivybridge)

3 Likes