Inplace axpy! but storing to a third arguement rather than y

Hello,

from the BLAS package, I know that axpy!(a,X,Y) computes Y += a X, where a is a scalar and X and Y are a matrix or a vector.

Now I want instead something like

Z = a X + Y

So, something like axpyz!(a,X,Y,Z).

Is there a BLAS funcion or a Julia function that perform this? If not, how to implement it with the maximum efficiency?

A common pattern for out-of-place versions of otherwise in-place functions is to copy the argument you don’t want overwritten to the output, like axpy!(a,X,copy!(Z,Y)).

That said, I imagine that your fastest option here is simply to use broadcasting like Z .= a .* X .+ Y. Even slightly better is to allow FMA to be used (for slightly higher speed and accuracy) via Z .= muladd.(a, X, Y) (which is almost-certainly what axpy! uses). These are better because they don’t have the extra step of copy!(Z,Y) like we had to use to make axpy! work.

For simple element-wise operations like these, BLAS is unlikely to do notably better than simple broadcast. The broadcast has the additional advantage of keeping everything in Julia, which means that additional types of elements and arrays will “just work” here. BLAS is only compiled for a few common types.

I tried the following example to test it with axpy!

NN = 400
A = rand(Float64, NN, NN)
B = rand(Float64, NN, NN);
B_cache = deepcopy(B);
B_cache2 = deepcopy(B);

function axpy2!(a,X,Y)
    @. Y += a*X
    Y
end
@benchmark axpy!(0.5, $A, $B_cache)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  62.600 μs … 344.000 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     70.600 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   73.767 μs ±  10.851 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▃ █▅▇▇▄▇▅▄▆▄▄▄▂▁▂▂▁▂▁  ▂                                   ▂
  █▁█▇████████████████████▇██▇▇██▇▆▇██▇▆▇█▆▆▅▃▃█▆▃▆▄▄▃█▄▁▁▃▄▁▇ █
  62.6 μs       Histogram: log(frequency) by time       126 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
@benchmark axpy2!(0.5, $A, $B_cache2)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  116.100 μs … 541.000 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     132.300 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   139.289 μs ±  29.856 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▆ █ █▄                                                      
  ▅▅▅█▇█▇██▄▃▃▂▂▃▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  116 μs           Histogram: frequency by time          267 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

The “home made” version is slightly slower.

Moreover, if I choose random sparse matrices, I have unexpectedly 500 times faster calcs, but I have also some allocations, whose size increase with the dimension of the matrix

NN = 400
A = sprand(Float64, NN, NN, 0.1/NN)
B = sprand(Float64, NN, NN, 0.1/NN);
B_cache = deepcopy(B);
B_cache2 = deepcopy(B);
@benchmark axpy!(0.5, $A, $B_cache)

BenchmarkTools.Trial: 2252 samples with 1 evaluation.
 Range (min … max):  1.611 ms …   5.790 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.186 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.206 ms ± 375.599 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                ▆█                                             
  ▁▃▂▂▂▄▃▆▅▅▅▄▅▄███▆▄▃▃▂▂▂▂▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  1.61 ms         Histogram: frequency by time        3.89 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.
@benchmark axpy2!(0.5, $A, $B_cache2)

BenchmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min … max):  3.788 μs … 998.751 μs  ┊ GC (min … max): 0.00% … 79.00%
 Time  (median):     4.162 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.876 μs ±  18.765 μs  ┊ GC (mean ± σ):  6.49% ±  1.69%

  ▁▅▇██▇▆▅▄▅▄▄▄▃▂▂▁▁▁▁▁                                       ▂
  ██████████████████████▇█▇▇██▇▇▇▇█▇▇███▇▇███▇▇▇▇███▆▆▆▇▆▆▆▅▄ █
  3.79 μs      Histogram: log(frequency) by time      8.12 μs <

 Memory estimate: 4.75 KiB, allocs estimate: 3.

you can use LoopVectorization to speed this up further

In my tests with @albertomercurio 's first example, I get the exact same timings for pretty much everything I try (axpy, manual loop, broadcast, @turbo), which is great - it means the base julia is already optimal. I do get better timings (in fact, a superlinear speedup!) when BLAS uses threading: check BLAS.set_num_threads. I would have thought this would be too small to benefit from threading, but apparently not…