I can confirm that this is system dependent.
julia> A = rand(10, 1000); B = copy(A); C = zero(A); D = zero(A);
julia> @btime map!(+, $C, $A, $B);
6.505 μs (0 allocations: 0 bytes)
julia> @btime $D .= $A .+ $B;
6.724 μs (0 allocations: 0 bytes)
julia> versioninfo()
Julia Version 1.8.3
Commit 0434deb161e (2022-11-14 20:14 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 8 × Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
Threads: 1 on 8 virtual cores
Environment:
JULIA_EDITOR = subl
On another system:
julia> A = rand(10, 1000); B = copy(A); C = zero(A); D = zero(A);
julia> @btime map!(+, $C, $A, $B);
8.947 μs (0 allocations: 0 bytes)
julia> @btime $D .= $A .+ $B;
12.092 μs (0 allocations: 0 bytes)
julia> versioninfo()
Julia Version 1.8.3
Commit 0434deb161e (2022-11-14 20:14 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 64 × AMD EPYC 7742 64-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, znver2)
Threads: 1 on 64 virtual cores
Environment:
JULIA_EDITOR = vi
On yet another
julia> A = rand(10, 1000); B = copy(A); C = zero(A); D = zero(A);
julia> @btime map!(+, $C, $A, $B);
10.730 μs (0 allocations: 0 bytes)
julia> @btime $D .= $A .+ $B;
12.526 μs (0 allocations: 0 bytes)
julia> versioninfo()
Julia Version 1.8.3
Commit 0434deb161e (2022-11-14 20:14 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 28 × Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, broadwell)
Threads: 1 on 28 virtual cores
I wish this wasn’t the case, as this makes it difficult to write performant code. In general, though, map!
does appear to be faster for wide matrices.