This an LLVM question that came up in the process of trying to make min
/max
faster. In that discussion, this other PR was brought up where @N5N3 suggests the following implementation for min
:
function __min(x, y)
diff = x - y
min = ifelse(signbit(diff), x, y)
anynan = isnan(x)|isnan(y)
ifelse(anynan, diff, min)
end
For Float64
arguments, the code_llvm
is
define double @julia___min_8974(double %0, double %1) #0 {
top:
%2 = fsub double %0, %1
%3 = bitcast double %2 to i64
%4 = icmp sgt i64 %3, -1
%5 = select i1 %4, double %1, double %0
%6 = fcmp ord double %0, %1
%7 = select i1 %6, double %5, double %2
ret double %7
}
and the x86 assembly is
vsubsd %xmm1, %xmm0, %xmm2
vmovq %xmm2, %rax
vmovapd %xmm1, %xmm3
testq %rax, %rax
jns .LBB0_2
vmovapd %xmm0, %xmm3
.LBB0_2:
vcmpordsd %xmm1, %xmm0, %xmm0
vblendvpd %xmm0, %xmm3, %xmm2, %xmm0
retq
However, the core loop of code_native(broadcast!,Tuple{typeof(__min),Vector{Float64},Vector{Float64},Vector{Float64}})
is
.LBB0_20: # %vector.body
# =>This Inner Loop Header: Depth=1
vmovupd (%rax,%rdi,8), %ymm0
vmovupd 32(%rax,%rdi,8), %ymm1
vmovupd (%rcx,%rdi,8), %ymm2
vmovupd 32(%rcx,%rdi,8), %ymm3
vsubpd %ymm2, %ymm0, %ymm4
vsubpd %ymm3, %ymm1, %ymm5
vblendvpd %ymm4, %ymm0, %ymm2, %ymm6
vblendvpd %ymm5, %ymm1, %ymm3, %ymm7
vcmpordpd %ymm2, %ymm0, %ymm0
vcmpordpd %ymm3, %ymm1, %ymm1
vblendvpd %ymm0, %ymm6, %ymm4, %ymm0
vblendvpd %ymm1, %ymm7, %ymm5, %ymm1
vmovupd %ymm0, (%rdx,%rdi,8)
vmovupd %ymm1, 32(%rdx,%rdi,8)
addq $8, %rdi
cmpq %rdi, %rsi
jne .LBB0_20
Note that the scalar assembly uses a jns
branch instruction and several surrounding mov
instructions while the broadcast!
version replaces all of these with a single blendv
. In total, the scalar version uses 8 instructions to perform the min
while the vblendvpd
substitution from the broadcast!
version uses only 4 for the actual computation (and obviously is operating on vector registers).
The LLVM for the same segment is identical to the scalar LLVM except that it is unrolled and uses vector arguments:
%62 = fsub <4 x double> %wide.load, %wide.load59
%63 = fsub <4 x double> %wide.load58, %wide.load60
%64 = bitcast <4 x double> %62 to <4 x i64>
%65 = bitcast <4 x double> %63 to <4 x i64>
%66 = icmp sgt <4 x i64> %64, <i64 -1, i64 -1, i64 -1, i64 -1>
%67 = icmp sgt <4 x i64> %65, <i64 -1, i64 -1, i64 -1, i64 -1>
%68 = select <4 x i1> %66, <4 x double> %wide.load59, <4 x double> %wide.load
%69 = select <4 x i1> %67, <4 x double> %wide.load60, <4 x double> %wide.load58
%70 = fcmp ord <4 x double> %wide.load, %wide.load59
%71 = fcmp ord <4 x double> %wide.load58, %wide.load60
%72 = select <4 x i1> %70, <4 x double> %68, <4 x double> %62
%73 = select <4 x i1> %71, <4 x double> %69, <4 x double> %63
The scalar version compiles to use a branch while the vector version is branchless (as vector code must be). Notice that even the scalar version uses one blendv
, just not the second that we’d expect (hope).
Does anyone have insight into why LLVM is compiling to the branching version in the scalar case? Am I wrong to think the branching version is inferior for using those 5 instructions (including a jump) where 1 would suffice? Obviously LLVM can figure out the branchless version because it does so for the vectorized version. I know vector instructions can be slightly more expensive (although doubtful they compare to the risk of a branch misprediction) but then why isn’t it using a branch in place of the second blendv
as well?
My `versioninfo`:
julia> versioninfo()
Julia Version 1.9.0-DEV.689
Commit 0fce3125b8 (2022-05-31 15:46 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 6 × Intel(R) Core(TM) i5-9600K CPU @ 3.70GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
Threads: 1 on 6 virtual cores