Using axpy!

mforets · November 14, 2017, 4:55pm

The operation of multiplication by a scalar and translation can be done with Base.LinAlg.axpy!(α, x, y). But I can also use y .+= α*x.
I was wondering: is it recommended to use one or the other for writing library code? What does .+= actually do? Does it call axpy! under the hood?

In microbenchmarks (see below), the first one allocates no memory at all, although their speed is “very very close”.

using Compat, BenchmarkTools, DataFrames

blas_1(α, x, y) = Base.LinAlg.axpy!(α, x, y)
broadcast_partial(α, x, y) = y .+= α*x
broadcast_full(α, x, y) = @. y += α*x

n = [10^k for k in 1:6]
x, y = [rand(ni) for ni in n], [rand(ni) for ni in n]
z = copy(y); w = copy(y)
α = sqrt(pi)

table = DataFrame(Algorithm=["blas_1", "broadcast_partial", "broadcast_full"])

for (i, ni) in enumerate(n) 
    blas_1_bench = @benchmark blas_1($α, $x[$i], $y[$i])
    broadcast_partial_bench = @benchmark broadcast_partial($α, $x[$i], $z[$i])
    broadcast_full_bench = @benchmark broadcast_full($α, $x[$i], $w[$i])
    times_ms = 1e-6 * [time(blas_1_bench),
                       time(broadcast_partial_bench),
                       time(broadcast_full_bench)]
    table[Symbol("n = $ni (ms)")] = [signif(ti, 3) for ti in times_ms]                                         
end

julia> table
3×7 DataFrames.DataFrame
│ Row │ Algorithm           │ n = 10 (ms) │ n = 100 (ms) │ n = 1000 (ms) │ n = 10000 (ms) │ n = 100000 (ms) │ n = 1000000 (ms) │
├─────┼─────────────────────┼─────────────┼──────────────┼───────────────┼────────────────┼─────────────────┼──────────────────┤
│ 1   │ "blas_1"            │ 3.0e-5      │ 0.00142      │ 0.00153       │ 0.00516        │ 0.01729         │ 0.31935          │
│ 2   │ "broadcast_partial" │ 6.0e-5      │ 0.00011      │ 0.00078       │ 0.00964        │ 0.11502         │ 2.34353          │
│ 3   │ "broadcast_full"    │ 3.0e-5      │ 4.0e-5       │ 0.00027       │ 0.00332        │ 0.05153         │ 0.75011          │

Edit 1: added the full loop-fusion version.
Edit 2: added a benchmark function (use IJulia to see a cute table). Benchmark data on a MacBook Pro (see details below).
Edit 3: round to three significative digits using signif(ti, 3).

julia> versioninfo()
Julia Version 0.6.0
Commit 903644385b (2017-06-19 13:05 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

antoine-levitt · November 14, 2017, 5:16pm

You want y .+= α.*x (note the dot) for the (mostly) non-allocating version, see e.g. More Dots: Syntactic Loop Fusion in Julia. It’s also much faster than α*x. axpy! is multithreaded though, so for larger arrays you’re going to see a speed difference.

mforets · November 14, 2017, 6:24pm

Got it, thanks! Indeed, the “full dots” version is even faster; i’ve edited the example.

antoine-levitt · November 14, 2017, 6:33pm

Interesting: for me axpy! is 20% faster on 10k elements. For smaller arrays it’s even worse, and axpy is 5x faster for arrays of size 100. For larger arrays the difference levels off, until multithreading kicks off and axpy gets faster again. I guess broadcasting still has some way to go?

dpsanders · November 14, 2017, 6:33pm

You need “$” in front of each variable in the call to the btime macro.

mforets · November 14, 2017, 6:42pm

Like @btime f($α, $x, $y)? What I wrote benchmarks something else?

antoine-levitt · November 14, 2017, 6:45pm

I see, thanks! @mforets, see the readme in GitHub - JuliaCI/BenchmarkTools.jl: A benchmarking framework for the Julia language. With that the broadcast is slightly faster. For n=100_000 axpy is faster because of multithreading, for n=1_000_000 it still burns all my cores, but actually takes exactly the same time as broadcasting. I officially give up understanding multithreading performance.

juthohaegeman · November 14, 2017, 7:59pm

BLAS level 1 is mostly memory bound and doesn’t benefit that much from multithreading.

@mforets, I don’t see the edited/updated code and benchmarks.

antoine-levitt · November 14, 2017, 8:09pm

I think that statement is architecture-dependent: I’ve been on machines where the memory bandwidth scaled with the number of threads, and machines where it didn’t. But yes that’s probably what’s happening here: for n=100_000, the arrays are still in cache and multithreading is efficient, and for large sizes it’s memory bandwidth bound anyway so doesn’t speedup with multithreading. Thanks for the explanation!

mforets · November 14, 2017, 9:09pm

There you go

antoine-levitt · November 15, 2017, 7:15am

Just out of curiosity, can you increase n and see if you reproduce what I see, that the benefits of multithreading disappear when you get out of the cache?

mforets · November 15, 2017, 11:42am

Hi Antoine, i could go up to 10^8, but further than that i get out of swap memory. Here is a plot in log-scale where we can see one crossover:

stevengj · November 15, 2017, 1:52pm

In real code (not just benchmarks), be sure you look at axpy in the context of any other operations you are doing on the arrays. If you are doing other O(n) operations too, then you’ll probably want to fuse these into a single loop with the axpy anyway. If you are also doing Ω(n²) or similarly expensive operations, then aggressively optimizing the O(n) (BLAS-1) stuff is probably not worth it.

Topic		Replies	Views
Inplace axpy! but storing to a third arguement rather than y Performance blas	4	245	December 6, 2023
When shall I use `BLAS.axpy!` and when `muladd`? Numerics	6	1696	December 3, 2017
What's the difference between LinearAlgebra.axpy! and BLAS.axpy!, and which one should packages extend? General Usage package , blas , linearalgebra , function	0	447	December 13, 2022
OpenBLAS vs. "for loops" demo Performance	6	831	October 7, 2019
Another BLAS and Julia comparison Numerics	16	3083	September 27, 2019

Using axpy!

Related topics