Adding matrices is very important thing in numerical computations and I fell that discusion of results of benchmarking this in Julia 1.x should take place here. Because I don’t find such topic on Julia Discourse I put what I found as a starting point. I apologized if I just miss this topic already discussed.
Few remarks. First, I was using Julia 1.3.1. Second, I don’t came up with this code, I just did tutorial for DiffEqTutorial.jl
section Optimizing DiffEq Code and was unable to understand results that I get.
We are benchmarking functions:
test1(A, B, C) = A + B + C
test2(A, B, C) = map((a, b, c) -> a + b + c, A, B, C)
function test3(A,B,C)
D = similar(A)
@inbounds for i in eachindex(A)
D[i] = A[i] + B[i] + C[i]
end
D
end
test4(A, B, C) = A .+ B .+ C
test5(A, B, C) = @. A + B + C
test6!(D, A, B, C) = D .= A .+ B .+ C
test7!(D, A, B, C) = @. D = A + B + C
test8!(D, A, B, C) = map!((a, b, c) -> a + b + c, D, A, B, C)
For this we create matrices
A = rand(1000,1000); B = rand(1000,1000); C = rand(1000,1000); D = zeros(1000,1000)
Resultants for benchmarking with @benchmark
run in Jupyter Notebook on 32GB RAM stationary computer.
Function | Memory allocated | Median of time (ms) |
---|---|---|
test(...) |
7.63 MiB | 2.399 |
test2(...) |
7.63 MiB | around 2.5 |
test3(...) |
7.63 MiB | around 2.5 |
test4(...) |
7.63 MiB | 2.421 |
test5(...) |
7.63 MiB | around 2.5 |
test6!(...) |
0 bytes | around 2.5 |
test7!(...) |
0 bytes | around 2.5 |
test8!(...) |
32 bytes | 4.029 |
Resultants for benchmarking with @benchmark
run in Jupyter Notebook on 4GB RAM laptop.
Function | Memory allocated | Median of time (ms) |
---|---|---|
test(...) |
7.63 MiB | 4.677 |
test2(...) |
7.63 MiB | 7.199 |
test3(...) |
7.63 MiB | 4.676 |
test4(...) |
7.63 MiB | 5.531 |
test5(...) |
7.63 MiB | 4.768 |
test6!(...) |
0 bytes | 4.682 |
test7!(...) |
0 bytes | 4.936 |
test8!(...) |
32 bytes | 11.588 |
Benchmarks for adding three 1000 x 1000 Float64 matrices in REPL on 32GB RAM stationary computer.
Function | Memory allocated | Median of time (ms) |
---|---|---|
test(...) |
7.63 MiB | 2.521 |
test2(...) |
7.63 MiB | 2.529 |
test3(...) |
7.63 MiB | 2.503 |
test4(...) |
7.63 MiB | 2.517 |
test5(...) |
7.63 MiB | 2.529 |
test6!(...) |
0 bytes | 2.517 |
test7!(...) |
0 bytes | 2.529 |
test8!(...) |
32 bytes | 4.355 |
Benchmarks for adding three 1000 x 1000 Float64 matrices in REPL on 4GB RAM laptop.
Function | Memory allocated | Median of time (ms) |
---|---|---|
test(...) |
7.63 MiB | 4.878 |
test2(...) |
7.63 MiB | 7.398 |
test3(...) |
7.63 MiB | 4.484 |
test4(...) |
7.63 MiB | 4.493 |
test5(...) |
7.63 MiB | 4.505 |
test6!(...) |
0 bytes | 4.470 |
test7!(...) |
0 bytes | 4.490 |
test8!(...) |
32 bytes | 10.579 |
I can’t understand why method 1 is over two time faster than 8, since I read somewhere that 8 should be faster, because it doesn’t allocate memory. Also I don’t understand why 6 and 7 use 0 bytes of memory and 8 use 32.