Why is a multi-argument inplace map much faster in this case than a broadcast?

I can confirm that this is system dependent.

julia> A = rand(10, 1000); B = copy(A); C = zero(A); D = zero(A);

julia> @btime map!(+, $C, $A, $B);
  6.505 μs (0 allocations: 0 bytes)

julia> @btime $D .= $A .+ $B;
  6.724 μs (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.8.3
Commit 0434deb161e (2022-11-14 20:14 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
  Threads: 1 on 8 virtual cores
Environment:
  JULIA_EDITOR = subl

On another system:

julia> A = rand(10, 1000); B = copy(A); C = zero(A); D = zero(A);

julia> @btime map!(+, $C, $A, $B);
  8.947 μs (0 allocations: 0 bytes)

julia> @btime $D .= $A .+ $B;
  12.092 μs (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.8.3
Commit 0434deb161e (2022-11-14 20:14 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × AMD EPYC 7742 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, znver2)
  Threads: 1 on 64 virtual cores
Environment:
  JULIA_EDITOR = vi

On yet another

julia> A = rand(10, 1000); B = copy(A); C = zero(A); D = zero(A);

julia> @btime map!(+, $C, $A, $B);
  10.730 μs (0 allocations: 0 bytes)

julia> @btime $D .= $A .+ $B;
  12.526 μs (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.8.3
Commit 0434deb161e (2022-11-14 20:14 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 28 × Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, broadwell)
  Threads: 1 on 28 virtual cores

I wish this wasn’t the case, as this makes it difficult to write performant code. In general, though, map! does appear to be faster for wide matrices.

1 Like