It looks like there isnât really performance to be improved here:
julia> using BenchmarkTools
julia> x = randn(Float64, 2^10); y = similar(x); # includes negatives
julia> @btime broadcast!(x -> Core.Intrinsics.sqrt_llvm(x), $y, $x); # intrinsic
692.568 ns (0 allocations: 0 bytes)
julia> @btime broadcast!(x -> x < zero(x) ? oftype(x,NaN) : sqrt(x), $y, $x); # NaNMath
692.568 ns (0 allocations: 0 bytes)
julia> x = rand(Float64, 2^10); y = similar(x); # no negatives
julia> @btime broadcast!(sqrt, $y, $x); # Base
1.420 Îźs (0 allocations: 0 bytes)
Here we see that the NaNMath
implementation manages to match the performance of the intrinsic on my machine on this problem. We see that the Base.sqrt
call is 2x slower (when it doesnât throw a domain error), with the dominant effect likely being that the branches prevent SIMD.
Iâll emphasize that the NaNMath
implementation manged to SIMD here, just like the intrinsic. Here Iâve copied the core loop on x86 for the NaNMath
and intrinsic versions:
julia> @code_native debuginfo=:none broadcast!(x -> Core.Intrinsics.sqrt_llvm(x), y, x);
...
.LBB0_13: # %vector.body
# =>This Inner Loop Header: Depth=1
vsqrtpd (%rax,%rsi,8), %ymm0
vmovupd %ymm0, (%rcx,%rsi,8)
addq $4, %rsi
cmpq %rsi, %rdx
jne .LBB0_13
...
julia> @code_native debuginfo=:none broadcast!(x -> x < zero(x) ? oftype(x,NaN) : sqrt(x), y, x);
...
.LBB0_13: # %vector.body
# =>This Inner Loop Header: Depth=1
vmovupd (%rax,%rdi,8), %ymm2
vcmpnltpd %ymm0, %ymm2, %ymm3
vsqrtpd %ymm2, %ymm2
vblendvpd %ymm3, %ymm2, %ymm1, %ymm2
vmovupd %ymm2, (%rcx,%rdi,8)
addq $4, %rdi
cmpq %rdi, %rdx
jne .LBB0_13
...
In the NaNMath
version, the SIMD sqrt is executed unconditionally. Then a blend is used to replace the result with NaN if the domain was invalid. This requires an extra 3 instructions (only 2 interesting ones) compared to the intrinsic, but instruction latency or some other effect (the 2 8KiB vectors for my benchmark easily fit within my L1 cache, so memory bandwidth is an unlikely culprit) apparently manages to squash the difference.
Thereâs literally a performance cost to these extra instructions, but itâs small enough to be unobservable in most contexts (including the above tight loop of only sqrt calls and loads/stores). I donât see value in mucking with internals for this, even though it would be very easy to do.
hypot
does so much extra stuff (relative to a naive sqrt(abs2(x)+abs2(y))
) that the check would be a very small cost even if it werenât removed. But if you check the code_llvm(Base.Math._hypot, NTuple{2,Float32}; debuginfo=:none)
(or the Float64
version, although it is considerably more complicated) youâll see that a naked sqrt
is called with no domain check because the compiler can tell that the argument is structurally nonnegative. There are many cases where the compiler can avoid domain checks.
log
is expensive enough (it doesnât have native instructions, at least on most general-purpose hardware), that I donât think thereâs notable performance to be reclaimed from the domain checks (and you have to introduce the NaNs manually, anyway, so itâs just a difference of branch-to-error or a conditional move/branch-to-NaN). But the compiler may skip many of them anyway for the same reason as it did in hypot
.