Why is broadcast faster than the dot syntax? (Performance differences between @., ., broadcast and broadcast!)

From the manual, I expected both cases below to perform equally well, but that seems not to be the case:

julia> using BenchmarkTools

julia> a = [2, 3, 4, 5];

julia> b = [6 7 8 9];

julia> @benchmark a .+ b
BenchmarkTools.Trial: 
  memory estimate:  256 bytes
  allocs estimate:  3
  --------------
  minimum time:     331.071 ns (0.00% GC)
  median time:      335.829 ns (0.00% GC)
  mean time:        362.109 ns (1.72% GC)
  maximum time:     6.075 μs (91.57% GC)
  --------------
  samples:          10000
  evals/sample:     225

julia> @benchmark broadcast(+, a, b)
BenchmarkTools.Trial: 
  memory estimate:  208 bytes
  allocs estimate:  1
  --------------
  minimum time:     89.245 ns (0.00% GC)
  median time:      91.958 ns (0.00% GC)
  mean time:        100.715 ns (1.96% GC)
  maximum time:     757.825 ns (74.58% GC)
  --------------
  samples:          10000
  evals/sample:     958

The dot syntax is over 3x slower. A full two-fold speed-up can be gained on top of that by the use of broadcast!:

julia> c = similar(a .+ b);

julia> @benchmark broadcast!(+, c, a, b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     49.568 ns (0.00% GC)
  median time:      50.333 ns (0.00% GC)
  mean time:        51.231 ns (0.00% GC)
  maximum time:     140.148 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     988

So all those three examples seem to be doing different things. To complicate things a bit, the following seem not to be equivalent to broadcast!:

julia> @benchmark c .= a .+ b
BenchmarkTools.Trial: 
  memory estimate:  48 bytes
  allocs estimate:  2
  --------------
  minimum time:     280.375 ns (0.00% GC)
  median time:      284.998 ns (0.00% GC)
  mean time:        298.332 ns (0.45% GC)
  maximum time:     5.007 μs (90.98% GC)
  --------------
  samples:          10000
  evals/sample:     285

julia> @benchmark @. c = a + b
BenchmarkTools.Trial: 
  memory estimate:  48 bytes
  allocs estimate:  2
  --------------
  minimum time:     281.868 ns (0.00% GC)
  median time:      297.573 ns (0.00% GC)
  mean time:        341.300 ns (0.42% GC)
  maximum time:     5.637 μs (91.15% GC)
  --------------
  samples:          10000
  evals/sample:     287

What’s going on here? What am I missing? Are those results valid in general? Should I always use broadcast!?

Why does the manual say that “[d]otted operators such as .+ and .* are equivalent to broadcast calls (except that they fuse, as described above)”? (source)

Does that have something to do with my environment?

julia> versioninfo()
Julia Version 1.4.1
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7-5500U CPU @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, broadwell)
Environment:
  JULIA_NUM_THREADS = 4
1 Like

I think a lot of this is due to benchmarking in the global scope. Does this change much if you define these in a function?

6 Likes

Wow-what a difference! You’re right; I wasn’t aware of that, thanks!

a .+ b and broadcast(+, a, b) indeed allocate only once, while all others have zero allocations.

julia> using BenchmarkTools

julia> a = [2, 3, 4, 5];

julia> b = [6 7 8 9];

julia> function add_dot(a, b)
           a .+ b
           return nothing
       end
add_dot (generic function with 1 method)

julia> function add_broadcast(a, b)
           broadcast(+, a, b)
           return nothing
       end
add_broadcast (generic function with 1 method)

julia> function add_broadcast!(c, a, b)
           broadcast!(+, c, a, b)
           return nothing
       end
add_broadcast! (generic function with 1 method)

julia> function add_dotdot(c, a, b)
           c .= a .+ b
           return nothing
       end
add_dotdot (generic function with 1 method)

julia> function add_atdot(c, a, b)
           @. c = a + b
           return nothing
       end
add_atdot (generic function with 1 method)

julia> @benchmark add_dot(a, b)
BenchmarkTools.Trial: 
  memory estimate:  208 bytes
  allocs estimate:  1
  --------------
  minimum time:     87.726 ns (0.00% GC)
  median time:      90.257 ns (0.00% GC)
  mean time:        96.742 ns (1.80% GC)
  maximum time:     771.672 ns (68.02% GC)
  --------------
  samples:          10000
  evals/sample:     961

julia> @benchmark add_broadcast(a, b)
BenchmarkTools.Trial: 
  memory estimate:  208 bytes
  allocs estimate:  1
  --------------
  minimum time:     92.308 ns (0.00% GC)
  median time:      95.286 ns (0.00% GC)
  mean time:        100.602 ns (1.83% GC)
  maximum time:     660.674 ns (75.04% GC)
  --------------
  samples:          10000
  evals/sample:     954

julia> c = similar(a .+ b);

julia> @benchmark add_broadcast!(c, a, b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     49.198 ns (0.00% GC)
  median time:      50.092 ns (0.00% GC)
  mean time:        50.921 ns (0.00% GC)
  maximum time:     146.801 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     987

julia> @benchmark add_dotdot(c, a, b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     47.524 ns (0.00% GC)
  median time:      47.884 ns (0.00% GC)
  mean time:        48.874 ns (0.00% GC)
  maximum time:     160.585 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     988

julia> @benchmark add_atdot(c, a, b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     50.838 ns (0.00% GC)
  median time:      51.519 ns (0.00% GC)
  mean time:        55.228 ns (0.00% GC)
  maximum time:     146.375 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     987
1 Like

You want to do @benchmark add_dot($a, $b) (or @btime add_dot($a, $b)) etcetera so that you don’t pay the price of dynamically dispatching on the type of the globals a and b.

5 Likes

Nice. Maybe a silly question, but does @benchmark $a .+ $b eliminate the issue of benchmarking in the global scope?

julia> @benchmark $a .+ $b
BenchmarkTools.Trial: 
  memory estimate:  208 bytes
  allocs estimate:  1
  --------------
  minimum time:     72.072 ns (0.00% GC)
  median time:      76.215 ns (0.00% GC)
  mean time:        85.396 ns (2.41% GC)
  maximum time:     771.627 ns (74.82% GC)
  --------------
  samples:          10000
  evals/sample:     974

julia> @benchmark add_dot($a, $b)
BenchmarkTools.Trial: 
  memory estimate:  208 bytes
  allocs estimate:  1
  --------------
  minimum time:     76.696 ns (0.00% GC)
  median time:      80.586 ns (0.00% GC)
  mean time:        97.871 ns (2.48% GC)
  maximum time:     2.102 μs (88.07% GC)
  --------------
  samples:          10000
  evals/sample:     971

julia> @benchmark a .+ b
BenchmarkTools.Trial: 
  memory estimate:  256 bytes
  allocs estimate:  3
  --------------
  minimum time:     386.757 ns (0.00% GC)
  median time:      394.337 ns (0.00% GC)
  mean time:        421.376 ns (1.54% GC)
  maximum time:     7.220 μs (91.95% GC)
  --------------
  samples:          10000
  evals/sample:     202
1 Like

Yes.

5 Likes