Using axpy!

blas
benchmark

#1

The operation of multiplication by a scalar and translation can be done with Base.LinAlg.axpy!(α, x, y). But I can also use y .+= α*x.
I was wondering: is it recommended to use one or the other for writing library code? What does .+= actually do? Does it call axpy! under the hood?

In microbenchmarks (see below), the first one allocates no memory at all, although their speed is “very very close”.

using Compat, BenchmarkTools, DataFrames

blas_1(α, x, y) = Base.LinAlg.axpy!(α, x, y)
broadcast_partial(α, x, y) = y .+= α*x
broadcast_full(α, x, y) = @. y += α*x

n = [10^k for k in 1:6]
x, y = [rand(ni) for ni in n], [rand(ni) for ni in n]
z = copy(y); w = copy(y)
α = sqrt(pi)

table = DataFrame(Algorithm=["blas_1", "broadcast_partial", "broadcast_full"])

for (i, ni) in enumerate(n) 
    blas_1_bench = @benchmark blas_1($α, $x[$i], $y[$i])
    broadcast_partial_bench = @benchmark broadcast_partial($α, $x[$i], $z[$i])
    broadcast_full_bench = @benchmark broadcast_full($α, $x[$i], $w[$i])
    times_ms = 1e-6 * [time(blas_1_bench),
                       time(broadcast_partial_bench),
                       time(broadcast_full_bench)]
    table[Symbol("n = $ni (ms)")] = [signif(ti, 3) for ti in times_ms]                                         
end
julia> table
3×7 DataFrames.DataFrame
│ Row │ Algorithm           │ n = 10 (ms) │ n = 100 (ms) │ n = 1000 (ms) │ n = 10000 (ms) │ n = 100000 (ms) │ n = 1000000 (ms) │
├─────┼─────────────────────┼─────────────┼──────────────┼───────────────┼────────────────┼─────────────────┼──────────────────┤
│ 1   │ "blas_1"            │ 3.0e-5      │ 0.00142      │ 0.00153       │ 0.00516        │ 0.01729         │ 0.31935          │
│ 2   │ "broadcast_partial" │ 6.0e-5      │ 0.00011      │ 0.00078       │ 0.00964        │ 0.11502         │ 2.34353          │
│ 3   │ "broadcast_full"    │ 3.0e-5      │ 4.0e-5       │ 0.00027       │ 0.00332        │ 0.05153         │ 0.75011          │

Edit 1: added the full loop-fusion version.
Edit 2: added a benchmark function (use IJulia to see a cute table). Benchmark data on a MacBook Pro (see details below).
Edit 3: round to three significative digits using signif(ti, 3).

julia> versioninfo()
Julia Version 0.6.0
Commit 903644385b (2017-06-19 13:05 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

When shall I use `BLAS.axpy!` and when `muladd`?
#2

You want y .+= α.*x (note the dot) for the (mostly) non-allocating version, see e.g. https://julialang.org/blog/2017/01/moredots. It’s also much faster than α*x. axpy! is multithreaded though, so for larger arrays you’re going to see a speed difference.


#3

Got it, thanks! Indeed, the “full dots” version is even faster; i’ve edited the example.


#4

Interesting: for me axpy! is 20% faster on 10k elements. For smaller arrays it’s even worse, and axpy is 5x faster for arrays of size 100. For larger arrays the difference levels off, until multithreading kicks off and axpy gets faster again. I guess broadcasting still has some way to go?


#5

You need “$” in front of each variable in the call to the btime macro.


#6

Like @btime f($α, $x, $y)? What I wrote benchmarks something else?


#7

I see, thanks! @mforets, see the readme in https://github.com/JuliaCI/BenchmarkTools.jl. With that the broadcast is slightly faster. For n=100_000 axpy is faster because of multithreading, for n=1_000_000 it still burns all my cores, but actually takes exactly the same time as broadcasting. I officially give up understanding multithreading performance.


#8

BLAS level 1 is mostly memory bound and doesn’t benefit that much from multithreading.

@mforets, I don’t see the edited/updated code and benchmarks.


#9

I think that statement is architecture-dependent: I’ve been on machines where the memory bandwidth scaled with the number of threads, and machines where it didn’t. But yes that’s probably what’s happening here: for n=100_000, the arrays are still in cache and multithreading is efficient, and for large sizes it’s memory bandwidth bound anyway so doesn’t speedup with multithreading. Thanks for the explanation!


#10

There you go :v:


#11

Just out of curiosity, can you increase n and see if you reproduce what I see, that the benefits of multithreading disappear when you get out of the cache?


#12

Hi Antoine, i could go up to 10^8, but further than that i get out of swap memory. Here is a plot in log-scale where we can see one crossover:


#13

In real code (not just benchmarks), be sure you look at axpy in the context of any other operations you are doing on the arrays. If you are doing other O(n) operations too, then you’ll probably want to fuse these into a single loop with the axpy anyway. If you are also doing Ω(n²) or similarly expensive operations, then aggressively optimizing the O(n) (BLAS-1) stuff is probably not worth it.