Which is the most efficient way to add matrices/arrays in Julia 1.x?

KZiemian · May 16, 2020, 10:03pm

Adding matrices is very important thing in numerical computations and I fell that discusion of results of benchmarking this in Julia 1.x should take place here. Because I don’t find such topic on Julia Discourse I put what I found as a starting point. I apologized if I just miss this topic already discussed.

Few remarks. First, I was using Julia 1.3.1. Second, I don’t came up with this code, I just did tutorial for DiffEqTutorial.jl section Optimizing DiffEq Code and was unable to understand results that I get.

We are benchmarking functions:

test1(A, B, C) = A + B + C

test2(A, B, C) = map((a, b, c) -> a + b + c, A, B, C)

function test3(A,B,C)
  D = similar(A)
  @inbounds for i in eachindex(A)
    D[i] = A[i] + B[i] + C[i]
  end
  D
end

test4(A, B, C) = A .+ B .+ C

test5(A, B, C) = @. A + B + C

test6!(D, A, B, C) = D .= A .+ B .+ C

test7!(D, A, B, C) = @. D = A + B + C

test8!(D, A, B, C) = map!((a, b, c) -> a + b + c, D, A, B, C)

For this we create matrices

A = rand(1000,1000); B = rand(1000,1000); C = rand(1000,1000); D = zeros(1000,1000)

Resultants for benchmarking with @benchmark run in Jupyter Notebook on 32GB RAM stationary computer.

Function	Memory allocated	Median of time (ms)
`test(...)`	7.63 MiB	2.399
`test2(...)`	7.63 MiB	around 2.5
`test3(...)`	7.63 MiB	around 2.5
`test4(...)`	7.63 MiB	2.421
`test5(...)`	7.63 MiB	around 2.5
`test6!(...)`	0 bytes	around 2.5
`test7!(...)`	0 bytes	around 2.5
`test8!(...)`	32 bytes	4.029

Resultants for benchmarking with @benchmark run in Jupyter Notebook on 4GB RAM laptop.

Function	Memory allocated	Median of time (ms)
`test(...)`	7.63 MiB	4.677
`test2(...)`	7.63 MiB	7.199
`test3(...)`	7.63 MiB	4.676
`test4(...)`	7.63 MiB	5.531
`test5(...)`	7.63 MiB	4.768
`test6!(...)`	0 bytes	4.682
`test7!(...)`	0 bytes	4.936
`test8!(...)`	32 bytes	11.588

Benchmarks for adding three 1000 x 1000 Float64 matrices in REPL on 32GB RAM stationary computer.

Function	Memory allocated	Median of time (ms)
`test(...)`	7.63 MiB	2.521
`test2(...)`	7.63 MiB	2.529
`test3(...)`	7.63 MiB	2.503
`test4(...)`	7.63 MiB	2.517
`test5(...)`	7.63 MiB	2.529
`test6!(...)`	0 bytes	2.517
`test7!(...)`	0 bytes	2.529
`test8!(...)`	32 bytes	4.355

Benchmarks for adding three 1000 x 1000 Float64 matrices in REPL on 4GB RAM laptop.

Function	Memory allocated	Median of time (ms)
`test(...)`	7.63 MiB	4.878
`test2(...)`	7.63 MiB	7.398
`test3(...)`	7.63 MiB	4.484
`test4(...)`	7.63 MiB	4.493
`test5(...)`	7.63 MiB	4.505
`test6!(...)`	0 bytes	4.470
`test7!(...)`	0 bytes	4.490
`test8!(...)`	32 bytes	10.579

I can’t understand why method 1 is over two time faster than 8, since I read somewhere that 8 should be faster, because it doesn’t allocate memory. Also I don’t understand why 6 and 7 use 0 bytes of memory and 8 use 32.

Nosferican · May 16, 2020, 11:14pm

A few comments,
Method 8 uses an anonymous function (not the best) and it does allocate the running sum before sending it to D.
Methods 6 and 7 are non-allocating because you already allocated it before calling the function (a bit of cheating compared to the other methods such as 3 which have the preallocation being included in the timing.
I don’t think there is a good reason to use anything other than method 1 unless you had already allocated a suitable object which then maybe 6 which can simply reuse it (in-place).
Just in case, do escape the $ arguments for the @benchmark.

tbeason · May 16, 2020, 11:27pm

If you are going to be doing this addition lots of times, it is likely that an inplace version will be faster. You would allocate the result array just once at the beginning. The actual addition operation is not going to be so different, as you saw.

DNF · May 16, 2020, 11:32pm

I wouldn’t recommend method 1, initially, because it should allocate intermediate temporary arrays. Method 4 should be better. I don’t know why the benchmarks are so similar, perhaps they’re not set up correctly.

rdeits · May 17, 2020, 2:54am

Anonymous functions used to be slow, but they have had zero overhead since Julia v0.5. In fact, all of the broadcasting solutions are also using anonymous functions internally, so the presence of an anonymous function cannot be the reason test8 is slower.

As for why map! seems slow, I admit I can’t tell. I looked briefly at the code it calls in Base, and I don’t see anything obviously wrong.

stevengj · May 17, 2020, 3:30am

I can’t reproduce this kind of huge difference. Are you remembering to interpolate the global arguments with $ into the benchmark expressions? Also, it’s more reliable to report the minimum time than the median or the mean, since the noise is all positive. @btime does the right thing:

julia> @btime test7!($D,$A,$B,$C);
  1.440 ms (0 allocations: 0 bytes)

julia> @btime test8!($D,$A,$B,$C);
  1.866 ms (1 allocation: 32 bytes)

Update: the above numbers were accidentally run with Julia 1.0.4. With 1.4 I get a big regression:

julia> @btime test7!($D,$A,$B,$C);
  1.455 ms (0 allocations: 0 bytes)

julia> @btime test8!($D,$A,$B,$C);
  2.926 ms (1 allocation: 32 bytes)

I reported an issue (50% performance regression in map! · Issue #35914 · JuliaLang/julia · GitHub).

KZiemian · May 17, 2020, 1:06pm

I was using code from Optimizing DiffEq Code, where there is no $. All relevant code from this page is below.

@benchmark test(A,B,C)

@benchmark test2(A,B,C)

@benchmark test3(A,B,C)

@benchmark test4(A,B,C)

@benchmark test5(A,B,C)

@benchmark test6!(D,A,B,C)

@benchmark test7!(D,A,B,C)

@benchmark test8!(D,A,B,C)

@benchmark A*B

@benchmark mul!(D,A,B)

If this is wrong way to do that I will start issue about that on tutorials GitHub. (My GitHub is a mess and I can’t make pull request without learning git finally). And maybe other benchmarks all also wrong?

Thank you for information that we should take minimum time and especially for this article. I must read it.

KZiemian · May 17, 2020, 1:59pm

Following @Nosferican and @stevengj advice’s I repeat benchmarking using always call with $ interpolation of global arguments.

@benchmark fun($A,$B,$C)

and reporting minimum time not median time. Again, I use Julia 1.3.1.

Results are only for my 4GB RAM laptop, because due to SARS-Cov-2 I’m cut off my stationary computer.

Below is all code that I use, so maybe someone find mistake in it.

using BenchmarkTools

A = rand(1000,1000); B = rand(1000,1000); C = rand(1000,1000); D = zeros(1000,1000);


test(A,B,C) = A + B + C

test2(A,B,C) = map((a,b,c)->a+b+c,A,B,C)

function test3(A,B,C)
    D = similar(A)
    @inbounds for i in eachindex(A)
        D[i] = A[i] + B[i] + C[i]
    end
    D
end

test4(A,B,C) = A .+ B .+ C

test5(A,B,C) = @. A + B + C

test6!(D,A,B,C) = D .= A .+ B .+ C

test7!(D,A,B,C) = @. D = A + B + C

test8!(D,A,B,C) = map!((a,b,c)->a+b+c,D,A,B,C)


@benchmark test($A,$B,$C)

@benchmark test2($A,$B,$C)

@benchmark test3($A,$B,$C)

@benchmark test4($A,$B,$C)

@benchmark test5($A,$B,$C)

@benchmark test6!($D,$A,$B,$C)

@benchmark test7!($D,$A,$B,$C)

@benchmark test6!($D,$A,$B,$C)

Jupyter Notebook results.

Function	Memory allocated	Minimum time (ms)
`test(...)`	7.63 MiB	4.463
`test2(...)`	7.63 MiB	6.867
`test3(...)`	7.63 MiB	4.412
`test4(...)`	7.63 MiB	4.418
`test5(...)`	7.63 MiB	4.469
`test6!(...)`	0 bytes	4.430
`test7!(...)`	0 bytes	4.425
`test8!(...)`	32 bytes	10.395

REPL

Function	Memory allocated	Minimum time (ms)
`test(...)`	7.63 MiB	4.479
`test2(...)`	7.63 MiB	6.939
`test3(...)`	7.63 MiB	4.485
`test4(...)`	7.63 MiB	4.497
`test5(...)`	7.63 MiB	4.491
`test6!(...)`	0 bytes	4.451
`test7!(...)`	0 bytes	4.477
`test8!(...)`	32 bytes	10.403

KZiemian · March 20, 2021, 3:37pm

I repeated benchmarks for adding 1000 \times 1000 matrices generated by function rand(1000, 1000) in Julia 1.5.3. Resultants were produced with the use @benchmark run in Jupyter Notebook on 4GB RAM laptop and due to advice from Steven G. Johnson now I write down minimal time as main measure. Median time is stated to make compression with previous results easier.

Original code was taken from DiffEqTutorials.jl tutorial.

Function	Memory allocated	Minimum time (ms)	Median time (ms)
`test($A, $B, $C)`	7.63 MiB	around 3.5	around 3.6
`test2($A, $B, $C)`	7.63 MiB	around 3.5	around 3.6
`test3($A, $B, $C)`	7.63 MiB	around 3.5	around 3.6
`test4($A, $B, $C)`	7.63 MiB	around 3.5	around 3.6
`test5($A, $B, $C)`	7.63 MiB	around 3.5	around 3.8
`test6!($D, $A, $B, $C)`	0 bytes	around 3.5	around 3.6
`test7!($D, $A, $B, $C)`	0 bytes	around 3.5	around 3.6
`test8!($D, $A, $B, $C)`	0 bytes	around 6.1	around 6.3

We can see easily than Julia 1.5 is over 20% faster than 1.4 (depending on each test results vary) and there is almost no difference in performance of the most methods, so one can just write A + B + C, which is great!

Only using map! give you 74% penalty in time, but compering to benefits this is not a big problem. I hope that Julia 1.6 will solve this problem, but I don’t dare to work with master branch, not now.

Relevant code from the page.

A = rand(1000,1000); B = rand(1000,1000); C = rand(1000,1000); D = zeros(1000,1000);

test(A, B, C) = A + B + C

test2(A, B, C) = map((a, b, c) -> a + b + c, A, B, C)

function test3(A, B, C)
    D = similar(A)
    @inbounds for i in eachindex(A)
        D[i] = A[i] + B[i] + C[i]
    end
    D
end

test4(A, B, C) = A .+ B .+ C

test5(A, B, C) = @. A + B + C

test6!(D, A, B, C) = D .= A .+ B .+ C

test7!(D, A, B, C) = @. D = A + B + C

test8!(D, A, B, C) = map!((a, b, c) -> a + b + c, D, A, B, C)

dpsanders · March 20, 2021, 3:42pm

You don’t need to work with the master branch. There is already a release candidate of 1.6 that will probably become the release.

KZiemian · March 22, 2021, 11:42am

I follow advise of @dpsanders and repeat tests in Julia 1.6.0-rc3. Since I have no experience with having two Julia version on the same machine, on computers strange things just happens, so better be cautious, I make this benchmarks in REPL on 4 GB RAM laptop.

Function	Memory allocated	Minimum of time (ms)
`test($A, $B, $C)`	7.63 MiB	around 3.575
`test2($A, $B, $C)`	7.63 MiB	around 3.6
`test3($A, $B, $C)`	7.63 MiB	around 3.6
`test4($A, $B, $C)`	7.63 MiB	around 3.6
`test5($A, $B, $C)`	7.63 MiB	around 3.65
`test6!($D, $A, $B, $C)`	0 bytes	around 3.6
`test7!($D, $A, $B, $C)`	0 bytes	around 3.7
`test8!($D, $A, $B, $C)`	0 bytes	around 6.5

It looks like there is still a penalty for using map!, but hopefully this problem will be resolved soon. I make post about it in GitHub issue.

KZiemian · March 26, 2021, 8:17pm

Since official Julia 1.6 is out, I repeated this benchmark using Jupyter Notebook on my 4GB RAM laptop.

Function	Memory allocated	Minimum time (ms)
`test($A, $B, $C)`	7.63 MiB	3.581
`test2($A, $B, $C)`	7.63 MiB	3.586
`test3($A, $B, $C)`	7.63 MiB	3.573
`test4($A, $B, $C)`	7.63 MiB	3.552
`test5($A, $B, $C)`	7.63 MiB	3.602
`test6!($A, $B, $C)`	0 bytes	3.582
`test7!($A, $B, $C)`	0 bytes	3.604
`test8!($A, $B, $C)`	0 bytes	6.363

It looks that something is still problem with map!, but I can live with that, since A + B + C is very fast.

Information about my settings.

Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-3210M CPU @ 2.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, ivybridge)

Topic		Replies	Views
Benchmark MATLAB & Julia for Matrix Operations Performance	148	19774	October 15, 2019
How to improve the performance while do matrix operation Performance	13	818	August 27, 2018
Julia is significantly slower (~18 x) than Matlab in vector and matrix algebra New to Julia	32	1887	June 25, 2023
Vector addition in Julia slower than numpy in Linux Performance	21	1294	May 1, 2020
Julia vs Matlab - building a Jacobian matrix New to Julia	30	4224	March 28, 2019

Which is the most efficient way to add matrices/arrays in Julia 1.x?

Related topics