The operation of multiplication by a scalar and translation can be done with Base.LinAlg.axpy!(α, x, y)
. But I can also use y .+= α*x
.
I was wondering: is it recommended to use one or the other for writing library code? What does .+=
actually do? Does it call axpy!
under the hood?
In microbenchmarks (see below), the first one allocates no memory at all, although their speed is “very very close”.
using Compat, BenchmarkTools, DataFrames
blas_1(α, x, y) = Base.LinAlg.axpy!(α, x, y)
broadcast_partial(α, x, y) = y .+= α*x
broadcast_full(α, x, y) = @. y += α*x
n = [10^k for k in 1:6]
x, y = [rand(ni) for ni in n], [rand(ni) for ni in n]
z = copy(y); w = copy(y)
α = sqrt(pi)
table = DataFrame(Algorithm=["blas_1", "broadcast_partial", "broadcast_full"])
for (i, ni) in enumerate(n)
blas_1_bench = @benchmark blas_1($α, $x[$i], $y[$i])
broadcast_partial_bench = @benchmark broadcast_partial($α, $x[$i], $z[$i])
broadcast_full_bench = @benchmark broadcast_full($α, $x[$i], $w[$i])
times_ms = 1e-6 * [time(blas_1_bench),
time(broadcast_partial_bench),
time(broadcast_full_bench)]
table[Symbol("n = $ni (ms)")] = [signif(ti, 3) for ti in times_ms]
end
julia> table
3×7 DataFrames.DataFrame
│ Row │ Algorithm │ n = 10 (ms) │ n = 100 (ms) │ n = 1000 (ms) │ n = 10000 (ms) │ n = 100000 (ms) │ n = 1000000 (ms) │
├─────┼─────────────────────┼─────────────┼──────────────┼───────────────┼────────────────┼─────────────────┼──────────────────┤
│ 1 │ "blas_1" │ 3.0e-5 │ 0.00142 │ 0.00153 │ 0.00516 │ 0.01729 │ 0.31935 │
│ 2 │ "broadcast_partial" │ 6.0e-5 │ 0.00011 │ 0.00078 │ 0.00964 │ 0.11502 │ 2.34353 │
│ 3 │ "broadcast_full" │ 3.0e-5 │ 4.0e-5 │ 0.00027 │ 0.00332 │ 0.05153 │ 0.75011 │
Edit 1: added the full loop-fusion version.
Edit 2: added a benchmark function (use IJulia to see a cute table). Benchmark data on a MacBook Pro (see details below).
Edit 3: round to three significative digits using signif(ti, 3)
.
julia> versioninfo()
Julia Version 0.6.0
Commit 903644385b (2017-06-19 13:05 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin13.4.0)
CPU: Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.9.1 (ORCJIT, haswell)