I tried the following example to test it with axpy!
NN = 400
A = rand(Float64, NN, NN)
B = rand(Float64, NN, NN);
B_cache = deepcopy(B);
B_cache2 = deepcopy(B);
function axpy2!(a,X,Y)
@. Y += a*X
Y
end
@benchmark axpy!(0.5, $A, $B_cache)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 62.600 μs … 344.000 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 70.600 μs ┊ GC (median): 0.00%
Time (mean ± σ): 73.767 μs ± 10.851 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▃ █▅▇▇▄▇▅▄▆▄▄▄▂▁▂▂▁▂▁ ▂ ▂
█▁█▇████████████████████▇██▇▇██▇▆▇██▇▆▇█▆▆▅▃▃█▆▃▆▄▄▃█▄▁▁▃▄▁▇ █
62.6 μs Histogram: log(frequency) by time 126 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
@benchmark axpy2!(0.5, $A, $B_cache2)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 116.100 μs … 541.000 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 132.300 μs ┊ GC (median): 0.00%
Time (mean ± σ): 139.289 μs ± 29.856 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▆ █ █▄
▅▅▅█▇█▇██▄▃▃▂▂▃▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
116 μs Histogram: frequency by time 267 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
The “home made” version is slightly slower.
Moreover, if I choose random sparse matrices, I have unexpectedly 500 times faster calcs, but I have also some allocations, whose size increase with the dimension of the matrix
NN = 400
A = sprand(Float64, NN, NN, 0.1/NN)
B = sprand(Float64, NN, NN, 0.1/NN);
B_cache = deepcopy(B);
B_cache2 = deepcopy(B);
@benchmark axpy!(0.5, $A, $B_cache)
BenchmarkTools.Trial: 2252 samples with 1 evaluation.
Range (min … max): 1.611 ms … 5.790 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.186 ms ┊ GC (median): 0.00%
Time (mean ± σ): 2.206 ms ± 375.599 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▆█
▁▃▂▂▂▄▃▆▅▅▅▄▅▄███▆▄▃▃▂▂▂▂▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
1.61 ms Histogram: frequency by time 3.89 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
@benchmark axpy2!(0.5, $A, $B_cache2)
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
Range (min … max): 3.788 μs … 998.751 μs ┊ GC (min … max): 0.00% … 79.00%
Time (median): 4.162 μs ┊ GC (median): 0.00%
Time (mean ± σ): 4.876 μs ± 18.765 μs ┊ GC (mean ± σ): 6.49% ± 1.69%
▁▅▇██▇▆▅▄▅▄▄▄▃▂▂▁▁▁▁▁ ▂
██████████████████████▇█▇▇██▇▇▇▇█▇▇███▇▇███▇▇▇▇███▆▆▆▇▆▆▆▅▄ █
3.79 μs Histogram: log(frequency) by time 8.12 μs <
Memory estimate: 4.75 KiB, allocs estimate: 3.