Why is a multi-argument inplace map much faster in this case than a broadcast?

julia> A = rand(10, 1000); B = copy(A); C = zero(A); D = zero(A);

julia> @btime map!(+, $C, $A, $B);
  3.405 μs (0 allocations: 0 bytes)

julia> @btime $D .= $A .+ $B;
  10.915 μs (0 allocations: 0 bytes)


julia> A = rand(1000, 10); B = copy(A); C = zero(A); D = zero(A);

julia> @btime map!(+, $C, $A, $B);
  3.041 μs (0 allocations: 0 bytes)

julia> @btime $D .= $A .+ $B;
  3.392 μs (0 allocations: 0 bytes)

This difference goes away if A is nearly square, but is considerable if the sizes are very different.

3 Likes

For me, in the second case

A = rand(1000, 10)

the broadcast is actually significantly faster than map!. My env is:

Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-4771 CPU @ 3.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, haswell)
Environment:
  JULIA_NUM_THREADS = 4

And for the square case, the broadcast version is much faster.

Oh interesting, so this may be llvm generating subpar code on my laptop, which I’ve experienced before

I can reproduce:

$ ./julia -O3 --min-optlevel=3
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.10.0-DEV.122 (2022-12-11)
 _/ |\__'_|_|_|\__'_|  |  Commit 704e1173879 (0 days old master)
|__/                   |

julia> using BenchmarkTools

julia> f(a, b) = (rand(a, b), rand(a, b), rand(a, b))
f (generic function with 1 method)

julia> @benchmark (c .= a .+ b) setup=((a,b,c) = f(10, 1000))
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (min … max):  6.598 μs …  13.447 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     6.717 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.793 μs ± 446.390 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂▆▆▃▂▁                                                     ▁
  ▇▆▇▅▅▁▃▁▃▁▃▁▃▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▃▁▁▁▁▁▃▁▃ 
  6.6 μs       Histogram: log(frequency) by time      10.4 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark (c .= a .+ b) setup=((a,b,c) = f(1000, 10))
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.961 μs …   7.451 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.083 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.114 μs ± 204.507 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

      ▁▅▆▆▆▅▃▃                                               
  ▂▆▇▆▆▆▅▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▁▁▂ 
  1.96 μs         Histogram: frequency by time        2.69 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark map!(+, c, a, b) setup=((a,b,c) = f(10, 1000))
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.955 μs …   8.022 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.047 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.080 μs ± 213.891 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▅▇▆▆▆▃▁                                                 
  ▃▆▆▅▅▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▂▂▁▂▂ 
  1.95 μs         Histogram: frequency by time        2.68 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark map!(+, c, a, b) setup=((a,b,c) = f(1000, 10))
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.954 μs …   7.545 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.082 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.109 μs ± 207.253 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

      ▁▅▅▆▆▇▇▆▅▅▂                                            
  ▅▅▇▇▆▅▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▁▂ 
  1.95 μs         Histogram: frequency by time        2.61 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Except that I’d say that the observation only holds for wide matrices, but not for tall matrices.

I can confirm that this is system dependent.

julia> A = rand(10, 1000); B = copy(A); C = zero(A); D = zero(A);

julia> @btime map!(+, $C, $A, $B);
  6.505 μs (0 allocations: 0 bytes)

julia> @btime $D .= $A .+ $B;
  6.724 μs (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.8.3
Commit 0434deb161e (2022-11-14 20:14 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
  Threads: 1 on 8 virtual cores
Environment:
  JULIA_EDITOR = subl

On another system:

julia> A = rand(10, 1000); B = copy(A); C = zero(A); D = zero(A);

julia> @btime map!(+, $C, $A, $B);
  8.947 μs (0 allocations: 0 bytes)

julia> @btime $D .= $A .+ $B;
  12.092 μs (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.8.3
Commit 0434deb161e (2022-11-14 20:14 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × AMD EPYC 7742 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, znver2)
  Threads: 1 on 64 virtual cores
Environment:
  JULIA_EDITOR = vi

On yet another

julia> A = rand(10, 1000); B = copy(A); C = zero(A); D = zero(A);

julia> @btime map!(+, $C, $A, $B);
  10.730 μs (0 allocations: 0 bytes)

julia> @btime $D .= $A .+ $B;
  12.526 μs (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.8.3
Commit 0434deb161e (2022-11-14 20:14 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 28 × Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, broadwell)
  Threads: 1 on 28 virtual cores

I wish this wasn’t the case, as this makes it difficult to write performant code. In general, though, map! does appear to be faster for wide matrices.

1 Like

I’ve created an issue to draw more eyeballs to this

Interesting.

julia> A = rand(10, 1000); B = copy(A); C = zero(A); D = zero(A);

julia> @btime map!(+, $C, $A, $B);
  8.467 μs (0 allocations: 0 bytes)

julia> @btime $D .= $A .+ $B;
  10.500 μs (0 allocations: 0 bytes)

julia> A = rand(1000, 10); B = copy(A); C = zero(A); D = zero(A);

julia> @btime map!(+, $C, $A, $B);
  9.100 μs (0 allocations: 0 bytes)

julia> @btime $D .= $A .+ $B;
  3.600 μs (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.8.3
Commit 0434deb161 (2022-11-14 20:14 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 20 × 12th Gen Intel(R) Core(TM) i9-12900HK
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, goldmont)
  Threads: 1 on 20 virtual cores
1 Like

Map can use linear indexing, broadcasting cannot for semantic reasons, but could in theory have a check for a fast path, like Fastbroadcast.jl.

Having only 10 rows prevents vectorization when not using linear indexing.

2 Likes

Interesting, so Cartesian indexing is faster in the tall case?

Why not?

FWIW, I get

julia> using FastBroadcast

julia> A = rand(10, 1000); B = copy(A); C = zero(A); D = zero(A);

julia> @btime map!(+, $C, $A, $B);
  1.764 μs (0 allocations: 0 bytes)

julia> @btime $D .= $A .+ $B;
  5.865 μs (0 allocations: 0 bytes)

julia> @btime @.. $D = $A + $B;
  1.767 μs (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.10.0-DEV.75
Commit 70bda2cfe4* (2022-11-30 01:44 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: 36 × Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz

What if size(A,1) == 1, or if size(B,1) == 1, or both?
Broadcasting has to handle all of these cases. map! does not.
FastBroadcast is faster than broadcast because it checks whether you’re not dynamically broadcasting. If not, it calls this (fast) code, else it throws or (if @.. broadcast=true) calls slow fallback code that can handle the dynamic broadcasting.

LLVM is pretty good about creating multiple versions of loops to make different cases fast, but by using cartesian indexing we don’t get vectorization unless the number of rows is big.

Cartesian indexing shouldn’t ever be faster than linear indexing here, but if it is mostly vectorized, the gap should be smaller.

Or if the arrays are large enough to be bottlenecked by memory bandwidth, then we won’t get a performance gap either because the CPU will just be sitting idle most of the time.

julia> A = rand(1000, 10); B = copy(A); C = zero(A); D = zero(A);

julia> @btime map!(+, $C, $A, $B);
  1.823 μs (0 allocations: 0 bytes)

julia> @btime $D .= $A .+ $B;
  1.836 μs (0 allocations: 0 bytes)

julia> @btime @.. $D = $A + $B;
  1.792 μs (0 allocations: 0 bytes)

@uniment , can you try starting Julia with -Cskylake and rerunning your benchmarks?

All your results are bad (except I guess for tall broadcasting), but that the tall case is somehow faster is hard to explain. It should be like I reported here.

4 Likes

Very interesting. It looks like 1.9.0-alpha1 is also substantially more performant for this.

1.8.3 with -Cskylake:

julia> using BenchmarkTools, FastBroadcast

julia> let A1 = rand(10, 1000), B1 = copy(A1), C1 = zero(A1), D1 = zero(A1), E1 = zero(A1),
           A2 = rand(1000, 10), B2 = copy(A2), C2 = zero(A2), D2 = zero(A2), E2 = zero(A2)

           @btime map!(+, $C1, $A1, $B1)
           @btime $D1 .= $A1 .+ $B1
           @btime @.. $E1 = $A1 + $B1
           @btime map!(+, $C2, $A2, $B2)
           @btime $D2 .= $A2 .+ $B2
           @btime @.. $E2 = $A2 + $B2

           C1 ≈ D1 ≈ E1 && C2 ≈ D2 ≈ E2
       end
  5.150 μs (0 allocations: 0 bytes)
  4.343 μs (0 allocations: 0 bytes)
  1.700 μs (0 allocations: 0 bytes)
  5.150 μs (0 allocations: 0 bytes)
  1.710 μs (0 allocations: 0 bytes)
  1.690 μs (0 allocations: 0 bytes)
true

julia> versioninfo()
Julia Version 1.8.3
Commit 0434deb161 (2022-11-14 20:14 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 20 × 12th Gen Intel(R) Core(TM) i9-12900HK
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, goldmont)
  Threads: 1 on 20 virtual cores

1.8.3 without -Cskylake:

julia> using BenchmarkTools, FastBroadcast

julia> let A1 = rand(10, 1000), B1 = copy(A1), C1 = zero(A1), D1 = zero(A1), E1 = zero(A1),
           A2 = rand(1000, 10), B2 = copy(A2), C2 = zero(A2), D2 = zero(A2), E2 = zero(A2)

           @btime map!(+, $C1, $A1, $B1)
           @btime $D1 .= $A1 .+ $B1
           @btime @.. $E1 = $A1 + $B1
           @btime map!(+, $C2, $A2, $B2)
           @btime $D2 .= $A2 .+ $B2
           @btime @.. $E2 = $A2 + $B2

           C1 ≈ D1 ≈ E1 && C2 ≈ D2 ≈ E2
       end
  9.400 μs (0 allocations: 0 bytes)
  14.900 μs (0 allocations: 0 bytes)
  3.443 μs (0 allocations: 0 bytes)
  8.800 μs (0 allocations: 0 bytes)
  5.000 μs (0 allocations: 0 bytes)
  5.250 μs (0 allocations: 0 bytes)
true

julia> versioninfo()
Julia Version 1.8.3
Commit 0434deb161 (2022-11-14 20:14 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 20 × 12th Gen Intel(R) Core(TM) i9-12900HK
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, goldmont)
  Threads: 1 on 20 virtual cores

1.8.0:

julia> using BenchmarkTools, FastBroadcast

julia> let A1 = rand(10, 1000), B1 = copy(A1), C1 = zero(A1), D1 = zero(A1), E1 = zero(A1),
           A2 = rand(1000, 10), B2 = copy(A2), C2 = zero(A2), D2 = zero(A2), E2 = zero(A2)

           @btime map!(+, $C1, $A1, $B1)
           @btime $D1 .= $A1 .+ $B1
           @btime @.. $E1 = $A1 + $B1
           @btime map!(+, $C2, $A2, $B2)
           @btime $D2 .= $A2 .+ $B2
           @btime @.. $E2 = $A2 + $B2

           C1 ≈ D1 ≈ E1 && C2 ≈ D2 ≈ E2
       end
  5.150 μs (0 allocations: 0 bytes)
  4.229 μs (0 allocations: 0 bytes)
  1.680 μs (0 allocations: 0 bytes)
  5.150 μs (0 allocations: 0 bytes)
  1.730 μs (0 allocations: 0 bytes)
  1.700 μs (0 allocations: 0 bytes)
true

julia> versioninfo()
Julia Version 1.8.0
Commit 5544a0fab7 (2022-08-17 13:38 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 20 × 12th Gen Intel(R) Core(TM) i9-12900HK
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, goldmont)
  Threads: 1 on 20 virtual cores

1.9.0-alpha1:

julia> using BenchmarkTools, FastBroadcast

julia> let A1 = rand(10, 1000), B1 = copy(A1), C1 = zero(A1), D1 = zero(A1), E1 = zero(A1),
           A2 = rand(1000, 10), B2 = copy(A2), C2 = zero(A2), D2 = zero(A2), E2 = zero(A2)

           @btime map!(+, $C1, $A1, $B1)
           @btime $D1 .= $A1 .+ $B1
           @btime @.. $E1 = $A1 + $B1
           @btime map!(+, $C2, $A2, $B2)
           @btime $D2 .= $A2 .+ $B2
           @btime @.. $E2 = $A2 + $B2

           C1 ≈ D1 ≈ E1 && C2 ≈ D2 ≈ E2
       end
  1.710 μs (0 allocations: 0 bytes)
  4.343 μs (0 allocations: 0 bytes)
  1.700 μs (0 allocations: 0 bytes)
  1.700 μs (0 allocations: 0 bytes)
  1.730 μs (0 allocations: 0 bytes)
  1.700 μs (0 allocations: 0 bytes)
true

julia> versioninfo()
Julia Version 1.9.0-alpha1
Commit 0540f9d739 (2022-11-15 14:37 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 20 × 12th Gen Intel(R) Core(TM) i9-12900HK
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, alderlake)
  Threads: 1 on 20 virtual cores

It appears the march of progress is non-monotonic. 1.8.0 and 1.9.0-alpha1 don’t seem to behave any differently with or without the -Cskylake option.

2 Likes

Note that your CPU identification went from “goldmont” in 1.8 to “alderlake” in 1.9.0-alpha1.
Seeing the “goldmont” is why I suggested trying “skylake”.

I’m curious how the assembly (@code_native syntax=:intel debuginfo=:none map!(+, C1, A1, B1)) differs between 1.8.0 and 1.8.3 w/ and w/out -Cskylake.
1.8.0 also said “goldmont”…

Interesting that map! improved for 1.9.

1 Like

1.8.3 with -Cskylake:

julia> let A1 = rand(10, 1000), B1 = copy(A1), C1 = zero(A1), D1 = zero(A1), E1 = zero(A1)
           @code_native syntax=:intel debuginfo=:none map!(+, C1, A1, B1)
       end
        .text
        .file   "map!"
        .globl  "japi1_map!_1084"               # -- Begin function japi1_map!_1084
        .p2align        4, 0x90
        .type   "japi1_map!_1084",@function
"japi1_map!_1084":                      # @"japi1_map!_1084"
        .cfi_startproc
# %bb.0:                                # %top
        push    rbp
        .cfi_def_cfa_offset 16
        .cfi_offset rbp, -16
        mov     rbp, rsp
        .cfi_def_cfa_register rbp
        push    rsi
        push    rdi
        push    rax
        .cfi_offset rdi, -32
        .cfi_offset rsi, -24
        mov     qword ptr [rbp - 24], rdx
        mov     rax, qword ptr [rdx + 8]
        mov     rcx, qword ptr [rax + 8]
        test    rcx, rcx
        je      .LBB0_7
# %bb.1:                                # %L24
        mov     rsi, qword ptr [rdx + 16]
        mov     r8, qword ptr [rsi + 8]
        test    r8, r8
        je      .LBB0_7
# %bb.2:                                # %L24
        mov     r10, qword ptr [rdx + 24]
        mov     rdx, qword ptr [r10 + 8]
        test    rdx, rdx
        je      .LBB0_7
# %bb.3:                                # %L84.preheader
        mov     r9, qword ptr [rsi]
        mov     r10, qword ptr [r10]
        mov     r11, qword ptr [rax]
        dec     rdx
        dec     r8
        dec     rcx
        xor     esi, esi
        .p2align        4, 0x90
.LBB0_4:                                # %L84
                                        # =>This Inner Loop Header: Depth=1
        vmovsd  xmm0, qword ptr [r9 + 8*rsi]    # xmm0 = mem[0],zero
        vaddsd  xmm0, xmm0, qword ptr [r10 + 8*rsi]
        vmovsd  qword ptr [r11 + 8*rsi], xmm0
        cmp     rcx, rsi
        je      .LBB0_7
# %bb.5:                                # %L147
                                        #   in Loop: Header=BB0_4 Depth=1
        cmp     r8, rsi
        je      .LBB0_7
# %bb.6:                                # %L147
                                        #   in Loop: Header=BB0_4 Depth=1
        lea     rdi, [rsi + 1]
        cmp     rdx, rsi
        mov     rsi, rdi
        jne     .LBB0_4
.LBB0_7:                                # %L172
        add     rsp, 8
        pop     rdi
        pop     rsi
        pop     rbp
        ret
.Lfunc_end0:
        .size   "japi1_map!_1084", .Lfunc_end0-"japi1_map!_1084"
        .cfi_endproc
                                        # -- End function
        .section        ".note.GNU-stack","",@progbits

1.8.3 without -Cskylake:

julia> let A1 = rand(10, 1000), B1 = copy(A1), C1 = zero(A1), D1 = zero(A1), E1 = zero(A1)
           @code_native syntax=:intel debuginfo=:none map!(+, C1, A1, B1)
       end
        .text
        .file   "map!"
        .globl  "japi1_map!_88"                 # -- Begin function japi1_map!_88
        .p2align        4, 0x90
        .type   "japi1_map!_88",@function
"japi1_map!_88":                        # @"japi1_map!_88"
        .cfi_startproc
# %bb.0:                                # %top
        push    rbp
        .cfi_def_cfa_offset 16
        .cfi_offset rbp, -16
        mov     rbp, rsp
        .cfi_def_cfa_register rbp
        push    rsi
        push    rax
        .cfi_offset rsi, -24
        mov     qword ptr [rbp - 16], rdx
        mov     rax, qword ptr [rdx + 8]
        mov     rcx, qword ptr [rax + 8]
        test    rcx, rcx
        je      .LBB0_7
# %bb.1:                                # %L24
        mov     rsi, qword ptr [rdx + 16]
        mov     r8, qword ptr [rsi + 8]
        test    r8, r8
        je      .LBB0_7
# %bb.2:                                # %L24
        mov     r10, qword ptr [rdx + 24]
        mov     rdx, qword ptr [r10 + 8]
        test    rdx, rdx
        je      .LBB0_7
# %bb.3:                                # %L84.preheader
        mov     r9, qword ptr [rsi]
        mov     r10, qword ptr [r10]
        mov     r11, qword ptr [rax]
        add     rdx, -1
        add     r8, -1
        add     rcx, -1
        xor     esi, esi
        .p2align        4, 0x90
.LBB0_4:                                # %L84
                                        # =>This Inner Loop Header: Depth=1
        vmovsd  xmm0, qword ptr [r9 + 8*rsi]    # xmm0 = mem[0],zero
        vaddsd  xmm0, xmm0, qword ptr [r10 + 8*rsi]
        cmp     rcx, rsi
        vmovsd  qword ptr [r11 + 8*rsi], xmm0
        je      .LBB0_7
# %bb.5:                                # %L147
                                        #   in Loop: Header=BB0_4 Depth=1
        cmp     r8, rsi
        je      .LBB0_7
# %bb.6:                                # %L147
                                        #   in Loop: Header=BB0_4 Depth=1
        cmp     rdx, rsi
        lea     rsi, [rsi + 1]
        jne     .LBB0_4
.LBB0_7:                                # %L172
        add     rsp, 8
        pop     rsi
        pop     rbp
        ret
.Lfunc_end0:
        .size   "japi1_map!_88", .Lfunc_end0-"japi1_map!_88"
        .cfi_endproc
                                        # -- End function
        .section        ".note.GNU-stack","",@progbits

1.8.0 with -Cskylake:

julia> let A1 = rand(10, 1000), B1 = copy(A1), C1 = zero(A1), D1 = zero(A1), E1 = zero(A1)
           @code_native syntax=:intel debuginfo=:none map!(+, C1, A1, B1)
       end
        .text
        .file   "map!"
        .globl  "japi1_map!_84"                 # -- Begin function japi1_map!_84
        .p2align        4, 0x90
        .type   "japi1_map!_84",@function
"japi1_map!_84":                        # @"japi1_map!_84"
        .cfi_startproc
# %bb.0:                                # %top
        push    rbp
        .cfi_def_cfa_offset 16
        .cfi_offset rbp, -16
        mov     rbp, rsp
        .cfi_def_cfa_register rbp
        push    rsi
        push    rdi
        push    rax
        .cfi_offset rdi, -32
        .cfi_offset rsi, -24
        mov     qword ptr [rbp - 24], rdx
        mov     rax, qword ptr [rdx + 8]
        mov     rcx, qword ptr [rax + 8]
        test    rcx, rcx
        je      .LBB0_7
# %bb.1:                                # %L24
        mov     rsi, qword ptr [rdx + 16]
        mov     r8, qword ptr [rsi + 8]
        test    r8, r8
        je      .LBB0_7
# %bb.2:                                # %L24
        mov     r10, qword ptr [rdx + 24]
        mov     rdx, qword ptr [r10 + 8]
        test    rdx, rdx
        je      .LBB0_7
# %bb.3:                                # %L84.preheader
        mov     r9, qword ptr [rsi]
        mov     r10, qword ptr [r10]
        mov     r11, qword ptr [rax]
        dec     rdx
        dec     r8
        dec     rcx
        xor     esi, esi
        .p2align        4, 0x90
.LBB0_4:                                # %L84
                                        # =>This Inner Loop Header: Depth=1
        vmovsd  xmm0, qword ptr [r9 + 8*rsi]    # xmm0 = mem[0],zero
        vaddsd  xmm0, xmm0, qword ptr [r10 + 8*rsi]
        vmovsd  qword ptr [r11 + 8*rsi], xmm0
        cmp     rcx, rsi
        je      .LBB0_7
# %bb.5:                                # %L147
                                        #   in Loop: Header=BB0_4 Depth=1
        cmp     r8, rsi
        je      .LBB0_7
# %bb.6:                                # %L147
                                        #   in Loop: Header=BB0_4 Depth=1
        lea     rdi, [rsi + 1]
        cmp     rdx, rsi
        mov     rsi, rdi
        jne     .LBB0_4
.LBB0_7:                                # %L172
        add     rsp, 8
        pop     rdi
        pop     rsi
        pop     rbp
        ret
.Lfunc_end0:
        .size   "japi1_map!_84", .Lfunc_end0-"japi1_map!_84"
        .cfi_endproc
                                        # -- End function
        .section        ".note.GNU-stack","",@progbits

1.8.0 without -Cskylake:

julia> let A1 = rand(10, 1000), B1 = copy(A1), C1 = zero(A1), D1 = zero(A1), E1 = zero(A1)
           @code_native syntax=:intel debuginfo=:none map!(+, C1, A1, B1)
       end
        .text
        .file   "map!"
        .globl  "japi1_map!_84"                 # -- Begin function japi1_map!_84
        .p2align        4, 0x90
        .type   "japi1_map!_84",@function
"japi1_map!_84":                        # @"japi1_map!_84"
        .cfi_startproc
# %bb.0:                                # %top
        push    rbp
        .cfi_def_cfa_offset 16
        .cfi_offset rbp, -16
        mov     rbp, rsp
        .cfi_def_cfa_register rbp
        push    rsi
        push    rax
        .cfi_offset rsi, -24
        mov     qword ptr [rbp - 16], rdx
        mov     rax, qword ptr [rdx + 8]
        mov     rcx, qword ptr [rax + 8]
        test    rcx, rcx
        je      .LBB0_7
# %bb.1:                                # %L24
        mov     rsi, qword ptr [rdx + 16]
        mov     r8, qword ptr [rsi + 8]
        test    r8, r8
        je      .LBB0_7
# %bb.2:                                # %L24
        mov     r10, qword ptr [rdx + 24]
        mov     rdx, qword ptr [r10 + 8]
        test    rdx, rdx
        je      .LBB0_7
# %bb.3:                                # %L84.preheader
        mov     r9, qword ptr [rsi]
        mov     r10, qword ptr [r10]
        mov     r11, qword ptr [rax]
        add     rdx, -1
        add     r8, -1
        add     rcx, -1
        xor     esi, esi
        .p2align        4, 0x90
.LBB0_4:                                # %L84
                                        # =>This Inner Loop Header: Depth=1
        vmovsd  xmm0, qword ptr [r9 + 8*rsi]    # xmm0 = mem[0],zero
        vaddsd  xmm0, xmm0, qword ptr [r10 + 8*rsi]
        cmp     rcx, rsi
        vmovsd  qword ptr [r11 + 8*rsi], xmm0
        je      .LBB0_7
# %bb.5:                                # %L147
                                        #   in Loop: Header=BB0_4 Depth=1
        cmp     r8, rsi
        je      .LBB0_7
# %bb.6:                                # %L147
                                        #   in Loop: Header=BB0_4 Depth=1
        cmp     rdx, rsi
        lea     rsi, [rsi + 1]
        jne     .LBB0_4
.LBB0_7:                                # %L172
        add     rsp, 8
        pop     rsi
        pop     rbp
        ret
.Lfunc_end0:
        .size   "japi1_map!_84", .Lfunc_end0-"japi1_map!_84"
        .cfi_endproc
                                        # -- End function
        .section        ".note.GNU-stack","",@progbits

Okay, that’s surprising. They all look the same and none of them vectorized in any fashion.
I guess -C"skylake" helps the FastBroadcast case, and that 1.9 is now allowing map! to vectorize, hence the perf boost there.

1 Like

Let me know if there are any other experiments you want me to try!