This an LLVM question that came up in the process of trying to make min/max faster. In that discussion, this other PR was brought up where @N5N3 suggests the following implementation for min:
function __min(x, y)
    diff = x - y
    min = ifelse(signbit(diff), x, y)
    anynan = isnan(x)|isnan(y)
    ifelse(anynan, diff, min)
end
For Float64 arguments, the code_llvm is
define double @julia___min_8974(double %0, double %1) #0 {
top:
  %2 = fsub double %0, %1
  %3 = bitcast double %2 to i64
  %4 = icmp sgt i64 %3, -1
  %5 = select i1 %4, double %1, double %0
  %6 = fcmp ord double %0, %1
  %7 = select i1 %6, double %5, double %2
  ret double %7
}
and the x86 assembly is
        vsubsd  %xmm1, %xmm0, %xmm2
        vmovq   %xmm2, %rax
        vmovapd %xmm1, %xmm3
        testq   %rax, %rax
        jns     .LBB0_2
        vmovapd %xmm0, %xmm3
.LBB0_2:
        vcmpordsd       %xmm1, %xmm0, %xmm0
        vblendvpd       %xmm0, %xmm3, %xmm2, %xmm0
        retq
However, the core loop of code_native(broadcast!,Tuple{typeof(__min),Vector{Float64},Vector{Float64},Vector{Float64}}) is
.LBB0_20:                               # %vector.body
                                        # =>This Inner Loop Header: Depth=1
        vmovupd (%rax,%rdi,8), %ymm0
        vmovupd 32(%rax,%rdi,8), %ymm1
        vmovupd (%rcx,%rdi,8), %ymm2
        vmovupd 32(%rcx,%rdi,8), %ymm3
        vsubpd  %ymm2, %ymm0, %ymm4
        vsubpd  %ymm3, %ymm1, %ymm5
        vblendvpd       %ymm4, %ymm0, %ymm2, %ymm6
        vblendvpd       %ymm5, %ymm1, %ymm3, %ymm7
        vcmpordpd       %ymm2, %ymm0, %ymm0
        vcmpordpd       %ymm3, %ymm1, %ymm1
        vblendvpd       %ymm0, %ymm6, %ymm4, %ymm0
        vblendvpd       %ymm1, %ymm7, %ymm5, %ymm1
        vmovupd %ymm0, (%rdx,%rdi,8)
        vmovupd %ymm1, 32(%rdx,%rdi,8)
        addq    $8, %rdi
        cmpq    %rdi, %rsi
        jne     .LBB0_20
Note that the scalar assembly uses a jns branch instruction and several surrounding mov instructions while the broadcast! version replaces all of these with a single blendv. In total, the scalar version uses 8 instructions to perform the min while the vblendvpd substitution from the broadcast! version uses only 4 for the actual computation (and obviously is operating on vector registers).
The LLVM for the same segment is identical to the scalar LLVM except that it is unrolled and uses vector arguments:
  %62 = fsub <4 x double> %wide.load, %wide.load59
  %63 = fsub <4 x double> %wide.load58, %wide.load60
  %64 = bitcast <4 x double> %62 to <4 x i64>
  %65 = bitcast <4 x double> %63 to <4 x i64>
  %66 = icmp sgt <4 x i64> %64, <i64 -1, i64 -1, i64 -1, i64 -1>
  %67 = icmp sgt <4 x i64> %65, <i64 -1, i64 -1, i64 -1, i64 -1>
  %68 = select <4 x i1> %66, <4 x double> %wide.load59, <4 x double> %wide.load
  %69 = select <4 x i1> %67, <4 x double> %wide.load60, <4 x double> %wide.load58
  %70 = fcmp ord <4 x double> %wide.load, %wide.load59
  %71 = fcmp ord <4 x double> %wide.load58, %wide.load60
  %72 = select <4 x i1> %70, <4 x double> %68, <4 x double> %62
  %73 = select <4 x i1> %71, <4 x double> %69, <4 x double> %63
The scalar version compiles to use a branch while the vector version is branchless (as vector code must be). Notice that even the scalar version uses one blendv, just not the second that we’d expect (hope).
Does anyone have insight into why LLVM is compiling to the branching version in the scalar case? Am I wrong to think the branching version is inferior for using those 5 instructions (including a jump) where 1 would suffice? Obviously LLVM can figure out the branchless version because it does so for the vectorized version. I know vector instructions can be slightly more expensive (although doubtful they compare to the risk of a branch misprediction) but then why isn’t it using a branch in place of the second blendv as well?
My `versioninfo`:
julia> versioninfo()
Julia Version 1.9.0-DEV.689
Commit 0fce3125b8 (2022-05-31 15:46 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 6 × Intel(R) Core(TM) i5-9600K CPU @ 3.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
  Threads: 1 on 6 virtual cores