# Fastest `sqrt` and `log` with negative check

`NaNMath.jl` implements special functions such as `sqrt` and `log` which return `NaN` if called with negative argument.

The implementation of `sqrt` is given by

``````sqrt(x::T) where {T<:AbstractFloat} = x < 0.0 ? T(NaN) : Base.sqrt(x)
``````

while `log` is implemented as

``````for f in (:sin, :cos, :tan, :asin, :acos, :acosh, :atanh, :log, :log2, :log10,
:lgamma, :log1p)
@eval begin
(\$f)(x::Float64) = ccall((\$(string(f)),libm), Float64, (Float64,), x)
(\$f)(x::Float32) = ccall((\$(string(f,"f")),libm), Float32, (Float32,), x)
(\$f)(x::Real) = (\$f)(float(x))
end
end
``````

which is (in my case) (slightly) slower than

``````log(x::T) where {T<:AbstractFloat} = x < 0.0 ? T(NaN) : Base.log(x)
``````

I also tried

``````function log(x::AbstractFloat)
if x < 0
return convert(typeof(x), NaN)
end
return Base.log(x)
end
``````

which seems to fall down to the the other implementation using the conditional assignment operator `?`.

Any ideas on even more efficient implementations?

I think `Core.Intrinsics.sqrt_llvm` and `Core.Intrinsics.sqrt_llvm_fast` (which should be equivalent to `@fastmath sqrt`) are as fast as you can get, they seem to lower to `@llvm.sqrt` and then to a single assembly instruction with a `Float64`.

Never mind, didnât realise you wanted NaN to be returned if negative. No idea if this is guaranteed to return NaN for negative values.

3 Likes

I donât love some of the original definitions in any case. The `x < 0.0` checks cause unnecessary promotion (e.g., `x::Float32 < 0.0::Float64` converts `x` to `Float64` before comparing). I would have written the check as `x < zero(T)` or `x < zero(x)` (without a clear opinion as to which is âbetterâ).

The fact that log/trig functions defer directly to `libm` seems unnecessary. Even if this were marginally faster (you say itâs slower), this seems somewhat un-Julian. I would have given them analogous definitions to the `sqrt` one (like you suggested), where they branch to `NaN` on invalid inputs and otherwise defer to their `Base` counterparts.

I guess going straight to `libm` might be convenient (if the semantics are correct, which I assume they are) because it means they donât have to encode the domains of these functions in these definitions? But a consequence is that none of these log/trig functions look like theyâll work for `Float16` (it looks like theyâll stack-overflow on the `::Real` method) which seems like a silly oversight.

5 Likes

Interestingly, `Core.Intrinsics.sqrt_llvm` and `Core.Intrinsics.sqrt_llvm_fast` return both `NaN` for negative floating point arguments

``````julia> @fastmath sqrt(-42.0)
NaN

julia> Core.Intrinsics.sqrt_llvm(-42.0)
NaN

julia> Core.Intrinsics.sqrt_llvm_fast(-42.0)
NaN
``````

`@fastmath sqrt` returns an error for negative ints

``````julia> @fastmath sqrt(-42)
ERROR: DomainError with -42.0:
sqrt will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).
Stacktrace:
[1] throw_complex_domainerror(f::Symbol, x::Float64)
@ Base.Math ./math.jl:33
[2] sqrt
@ ./math.jl:677 [inlined]
[3] sqrt
@ ./math.jl:1491 [inlined]
[4] sqrt_fast(x::Int64)
@ Base.FastMath ./fastmath.jl:349
[5] top-level scope
@ REPL[6]:1
``````

And `Core.Intrinsics.sqrt_llvm`, `Core.Intrinsics.sqrt_llvm_fast` return gibberish for `int` arguments (both positive and negative):

``````julia> Core.Intrinsics.sqrt_llvm_fast(42)
2200549683878040164

julia> Core.Intrinsics.sqrt_llvm(42)
2200549683878040164

julia> Core.Intrinsics.sqrt_llvm(-42)
-42

julia> Core.Intrinsics.sqrt_llvm_fast(-42)
-42
``````

I imagine if you pass an int itâs being reinterpreted to a float.

This is indeed what happens:

``````julia> Core.Intrinsics.sqrt_llvm(42)
2200549683878040164

julia> reinterpret(Int, Core.Intrinsics.sqrt_llvm(reinterpret(Float64, 42)))
2200549683878040164

julia> Core.Intrinsics.sqrt_llvm(-42)
-42

julia> reinterpret(Int, Core.Intrinsics.sqrt_llvm(reinterpret(Float64, -42)))
-42
``````

sqrt (and log) is inherently slow, but does not involve a test, can be done in one assembly instruction without in it (well donât by the FPU). Maybe log can too in CPUs, but I think done in software, sometimes thatâs faster than the single instructionâŚ

Branches, at least not-taken/unpredictable are slow, and probably hinder SIMD. I think with non-SIMD itâs actually not slower, with sqrt hiding its latency.

``````@btime @fastmath log(19000)  # I DO get (only) 3.6% faster for some larger values.
``````

But might lose that advantage, without that macro I get a function call, and inlined might seem to be the reason itâs faster, but the much larger inlined code might kill performance elsewhere if not fitting in L1 cache. Maybe the optimizer is clever and would then NOT inline. Also does a call always kill SIMD performance, I think soâŚ

``````julia> @btime @fastmath NaNMath.log(19000)
13.751 ns (0 allocations: 0 bytes)
9.852194258148577

julia> @btime log(19000)
1.877 ns (0 allocations: 0 bytes)
9.852194258148577
``````

Intriguing, and same slow performance without the macroâŚ

Possibly throwing isnât that bad, it needs a check, which at least without SIMD, isnât too bad. Throwing is most probably slower, I think in Julia as in C++ throw is no-cost vs if, if not thrown, more costly if thrown in both languages.

I would test with SIMD vectors, and while Julia likely checks each number separately I think maybe you could check all at once if any number is negative, as a fast path?

You may want to avoid the check.

something like:

``````assert(x >= 0)
sqrt(x)
``````

Thatâs likely dangerous (except with an unsigned type), but got me thinking why donât we have Unsigned floats? Youâre likely pushing the checks around then, to some earlier cast to it. But might it be useful? One sign bit seems not to bad of an overhead, but you could do a UFloat8 or UFloat16, and then the extra bit of accuracy you gain might be valuable. Maybe for neural networks, or do all parameters require negative too possible? For some other app? All operations could be fast except minus which would need to check and throw or give a NaN8.

I hope[d] not, itâs slower!

``````julia> @btime Core.Intrinsics.sqrt_llvm_fast(1.9)
3.872 ns (0 allocations: 0 bytes)
1.378404875209022

julia> @btime Core.Intrinsics.sqrt_llvm(1.9)
1.950 ns (0 allocations: 0 bytes)
1.378404875209022
``````

at least on 1.9.4.

``````julia> @btime @fastmath sqrt(1.9)
3.361 ns (0 allocations: 0 bytes)
1.378404875209022

julia> @btime @fastmath sqrt(19)
1.875 ns (0 allocations: 0 bytes)
4.358898943540674
``````

I.e. `@fastmath` made slower, and never faster for what I tested. Maybe this is a regression.

`@btime Core.Intrinsics.sqrt_llvm(\$(Ref(1.9))[])` and `@btime Core.Intrinsics.sqrt_llvm_fast(\$(Ref(1.9))[])` are the same speed for me. The assembly is the same as well so probably just a benchmarking problem.

2 Likes

`@fastmath` only applies to floating point operations, which is why it is a no-op in `@fastmath(sqrt(-42))` and an error is still thrown as if it were called without `@fastmath`.

But you should absolutely not use `@fastmath sqrt(x)` with the intent to return NaN for this operation. This is because `@fastmath` enables the âno nansâ optimization flag. This means that an input or output NaN results in undefined behavior. So, really, you should never use `@fastmath sqrt(x)` unless you know `x >= 0`.

This example is a tad contrived, but itâs a minimal demonstration of the issue. We have a function that wants to return `@fastmath sqrt(x)` if the result is non NaN and some fallback value `y` otherwise (e.g., if the input is negative or NaN).

``````julia> function sqrt_ifnotnan(x, y)
r = @fastmath sqrt(x)
return isnan(r) ? y : r
end
sqrt_ifnotnan (generic function with 1 method)

julia> @code_llvm debuginfo=:none sqrt_ifnotnan(-3.0,-1.0)
define double @julia_sqrt_ifnotnan_545(double %0, double %1) #0 {
top:
%2 = call fast double @llvm.sqrt.f64(double %0)
; where is the code to return y?
ret double %2
}

julia> sqrt_ifnotnan(-3.0,-1.0)
NaN # wasn't this supposed to return -1.0 instead of NaN?
``````

This may look overly-contrived, but the no-nans flag is a battleship-sized footgun. It will burn you sooner or later in some absurd situation unless you are extraordinarily careful.

`llvm.sqrt` documents the following:

Return the same value as a corresponding libm â`sqrt`â function but without trapping or setting `errno`. For types specified by IEEE-754, the result matches a conforming libm implementation.

Itâs hard to find authoritative-looking documentation of `libm` semantics, but a few things Iâve found do seem to suggest that negative values are indeed defined to return `NaN` (and also raise a floating point exception, although LLVM makes the amendment that no exception is raised) rather than being undefined.

This means that `Core.Intrinsics.llvm_sqrt` is probably valid for a return-NaN-on-negative semantic even though `@fastmath(sqrt(x))` is not. But Iâm always wary of calling intrinsics.

Note that you arenât interpolating inputs so donât give full faith to benchmarking results. Intrinsics behave funny in ways that normal functions usually donât. Itâs better to define `sqrt_from_llvm(x) = Core.Intrinsics.sqrt_llvm(x)`.

Iâll remark that `sqrt` is already a single native instruction thatâs pretty fast on most hardware. `@fastmath` doesnât do anything faster (except skip the `x < 0` check) unless there is a faster version of `sqrt` available with reduced accuracy (not on most systems).

Note that skipping the error check isnât that hard. For example `sqrt(abs(x))` knows that `abs(x) < 0` is impossible, so it doesnât check (and `abs` is cheap). So a code snippet like `ifelse(x < zero(x), oftype(x,NaN), sqrt(abs(x)))` wonât have any code for throwing. As scalar code it still held a branch, but if vectorized it might eliminate that. Or `sqrt(ifelse(x == abs(x), abs(x), oftype(x,NaN)))` had neither branches nor errors when I checked. Although either of these do add a bit more logic and change the return value for `x = -0.0` to `+0.0`.

1 Like

You want to generate ONE assembly instruction. I didnât confirm if it returns NaN for negative on x86 or an exception, but on ARM (vector instruction):

https://developer.arm.com/documentation/ddi0596/2020-12/SIMD-FP-Instructions/FSQRT--vector---Floating-point-Square-Root--vector--

This instruction can generate a floating-point exception. Depending on the settings in FPCR, the exception results in either a flag being set in FPSR or a synchronous exception being generated. For more information, see Floating-point exception traps.
[âŚ]
It has encodings from 2 classes: Half-precision and Single-precision and double-precision

Those are also the same for me (or if not, bench-marking noise difference).

How did you see the code, these do not work:

``````julia> @code_native Core.Intrinsics.sqrt_llvm_fast((Ref(1.9))[])
ERROR: ArgumentError: argument is not a generic function

julia> @code_native Core.Intrinsics.sqrt_llvm(1.9)  # this one IS faster but probably since optimized to a constant...
ERROR: ArgumentError: argument is not a generic function
``````

I was kind of expecting a throw on negative, but the domain error does disappear, and I DID get NaN for `@fastmath sqrt(-1.0)` and seemingly any negative value. Youâre saying itâs undefined and not to be trusted, but maybe it can/will never change?

No, not even this skips the check:

``````x = -1.0; @code_native sqrt(@inline abs(x))
``````

but it should. Neither the former does, not should it for sure, and neither would give you a NaN.

As I warned, intrinsics act funny and should be wrapped in normal functions:

``````julia> code_native(x->Core.Intrinsics.sqrt_llvm(x), (Float64,); debuginfo=:none)
.text
.file   "#77"
.globl  "julia_#77_613"                 # -- Begin function julia_#77_613
.p2align        4, 0x90
.type   "julia_#77_613",@function
"julia_#77_613":                        # @"julia_#77_613"
.cfi_startproc
# %bb.0:                                # %top
pushq   %rbp
.cfi_def_cfa_offset 16
.cfi_offset %rbp, -16
movq    %rsp, %rbp
.cfi_def_cfa_register %rbp
vsqrtsd %xmm0, %xmm0, %xmm0
popq    %rbp
.cfi_def_cfa %rsp, 8
retq
.Lfunc_end0:
.size   "julia_#77_613", .Lfunc_end0-"julia_#77_613"
.cfi_endproc
# -- End function
.section        ".note.GNU-stack","",@progbits
``````

Note that this `@code_native` is just the `@code_native sqrt(x)`. The macro requires care for nested function calls. I prefer to use the function form of `code_native` for inline-defined functions:

``````julia> @code_native debuginfo=:none (x->sqrt(abs(x)))(1.0) # macro form

julia> code_native(x->sqrt(abs(x)), (Float64,); debuginfo=:none) # function form

# but the code_llvm is clearer in this case:

julia> code_llvm(x->sqrt(x), (Float64,); debuginfo=:none)
define double @"julia_#87_627"(double %0) #0 {
top:
%1 = fcmp uge double %0, 0.000000e+00
br i1 %1, label %L5, label %L3

L3:                                               ; preds = %top
call void @j_throw_complex_domainerror_629({}* inttoptr (i64 140632027094512 to {}*), double %0) #2
unreachable

L5:                                               ; preds = %top
%2 = call double @llvm.sqrt.f64(double %0)
ret double %2
}

julia> code_llvm(x->sqrt(abs(x)), (Float64,); debuginfo=:none)
define double @"julia_#89_630"(double %0) #0 {
top:
%1 = call double @llvm.fabs.f64(double %0)
%2 = call double @llvm.sqrt.f64(double %1)
ret double %2
}
``````

Iâm saying that `@fastmath` is allowed to emit undefined behavior when values are `NaN`. Even if the result of `@fastmath sqrt(-1.0)` is actually properly defined to be NaN, using that value in any subsequent operation causes undefined behavior because `@fastmath` has tagged this result as âno-nanâ. See the example in my above post where an `isnan` check was dropped following a `@fastmath sqrt`. No-nan and no-inf are (in my opinion) usually the most dangerous `@fastmath` flags and were the main motivators for me opening this issue.

2 Likes

I donât think this is a concern in this case as the promotion can be done at compile time. Checking with `@code_llvm` shows that the literal is converted to the appropriate type. So perhaps the âcleanestâ way to write it would be simply `x < 0`.

Hmmm. It looks like youâre mostly right. For the `BitInteger` and `IEEEFloat` types I tried the `0.0` gets âdemotedâ by LLVM to a suitable type. Although `Rational{Int}` still results in an mixed-type comparison, comparing it to `0` works well.

`BigInt` and `BigFloat` give different code with `0` and `zero(x)`, but itâs too messy for me to bother to make sense of. Itâs possible that `0` is actually more efficient for them.

Although I would still encourage people to be pedantic and use `zero(x)` or at least `false` (the narrowest `Real` zero). LLVM catches most of the `Base` types with just `0` but the package ecosystem is a wide world.

1 Like

As you show this code `Core.Intrinsics.sqrt_llvm(x)` is better, gives you one instruction, no check, but does it work for SIMD (likely neither works for non-machine types, e.g. Posits; nor does for integers, but Julia should likely do a fallback for them)? I think not, so whatâs the best fast alternative?

I confirm the abs trick works and is likely fast, also for SIMD:

``````code_native(x->sqrt(@inline abs(x)), (Float64,); debuginfo=:none)  # also works down to Float16, and likely Posits
``````

but it does not give you NaN (or NaR for Posits) for negative numbers, so you must be careful. Rather what you âaskâ for, in that case, which is wrong.

I looked at `hypot` which implies a square root and it just uses sqrt, not sqrt_llvm, so Iâm not sure itâs as fast as it could be.

In converting a float value to a posit value, all forms of infinity and NaN convert to NaR.

2 Likes

Thanks for the Cult of Posits link!

1 Like

It looks like there isnât really performance to be improved here:

``````julia> using BenchmarkTools

julia> x = randn(Float64, 2^10); y = similar(x); # includes negatives

julia> @btime broadcast!(x -> Core.Intrinsics.sqrt_llvm(x), \$y, \$x); # intrinsic
692.568 ns (0 allocations: 0 bytes)

julia> @btime broadcast!(x -> x < zero(x) ? oftype(x,NaN) : sqrt(x), \$y, \$x); # NaNMath
692.568 ns (0 allocations: 0 bytes)

julia> x = rand(Float64, 2^10); y = similar(x); # no negatives

julia> @btime broadcast!(sqrt, \$y, \$x); # Base
1.420 Îźs (0 allocations: 0 bytes)
``````

Here we see that the `NaNMath` implementation manages to match the performance of the intrinsic on my machine on this problem. We see that the `Base.sqrt` call is 2x slower (when it doesnât throw a domain error), with the dominant effect likely being that the branches prevent SIMD.

Iâll emphasize that the `NaNMath` implementation manged to SIMD here, just like the intrinsic. Here Iâve copied the core loop on x86 for the `NaNMath` and intrinsic versions:

``````julia> @code_native debuginfo=:none broadcast!(x -> Core.Intrinsics.sqrt_llvm(x), y, x);
...
.LBB0_13:                               # %vector.body
# =>This Inner Loop Header: Depth=1
vsqrtpd (%rax,%rsi,8), %ymm0
vmovupd %ymm0, (%rcx,%rsi,8)
cmpq    %rsi, %rdx
jne     .LBB0_13
...

julia> @code_native debuginfo=:none broadcast!(x -> x < zero(x) ? oftype(x,NaN) : sqrt(x), y, x);
...
.LBB0_13:                               # %vector.body
# =>This Inner Loop Header: Depth=1
vmovupd (%rax,%rdi,8), %ymm2
vcmpnltpd       %ymm0, %ymm2, %ymm3
vsqrtpd %ymm2, %ymm2
vblendvpd       %ymm3, %ymm2, %ymm1, %ymm2
vmovupd %ymm2, (%rcx,%rdi,8)
cmpq    %rdi, %rdx
jne     .LBB0_13
...
``````

In the `NaNMath` version, the SIMD sqrt is executed unconditionally. Then a blend is used to replace the result with NaN if the domain was invalid. This requires an extra 3 instructions (only 2 interesting ones) compared to the intrinsic, but instruction latency or some other effect (the 2 8KiB vectors for my benchmark easily fit within my L1 cache, so memory bandwidth is an unlikely culprit) apparently manages to squash the difference.

Thereâs literally a performance cost to these extra instructions, but itâs small enough to be unobservable in most contexts (including the above tight loop of only sqrt calls and loads/stores). I donât see value in mucking with internals for this, even though it would be very easy to do.

`hypot` does so much extra stuff (relative to a naive `sqrt(abs2(x)+abs2(y))`) that the check would be a very small cost even if it werenât removed. But if you check the `code_llvm(Base.Math._hypot, NTuple{2,Float32}; debuginfo=:none)` (or the `Float64` version, although it is considerably more complicated) youâll see that a naked `sqrt` is called with no domain check because the compiler can tell that the argument is structurally nonnegative. There are many cases where the compiler can avoid domain checks.

`log` is expensive enough (it doesnât have native instructions, at least on most general-purpose hardware), that I donât think thereâs notable performance to be reclaimed from the domain checks (and you have to introduce the NaNs manually, anyway, so itâs just a difference of branch-to-error or a conditional move/branch-to-NaN). But the compiler may skip many of them anyway for the same reason as it did in `hypot`.

2 Likes

We are continuing now with the following implementations:

``````@inline sqrt(x::Real) = x < zero(x) ? oftype(x, NaN) : Base.sqrt(x)

@inline sqrt(x::Float64) = ccall("llvm.sqrt.f64", llvmcall, Float64, (Float64,), x)
@inline sqrt(x::Float32) = ccall("llvm.sqrt.f32", llvmcall, Float32, (Float32,), x)
@inline sqrt(x::Float16) = ccall("llvm.sqrt.f16", llvmcall, Float16, (Float16,), x)

@inline log(x::Real) = x < zero(x) ? oftype(x, NaN) : Base.log(x)

@inline log(x::Float64) = ccall("llvm.log.f64", llvmcall, Float64, (Float64,), x)
@inline log(x::Float32) = ccall("llvm.log.f32", llvmcall, Float32, (Float32,), x)
@inline log(x::Float16) = ccall("llvm.log.f16", llvmcall, Float16, (Float16,), x)
``````

while one could replace the `Float`-specialized `sqrt` calls by the in this post mentioned `sqrt_llvm` as

``````@inline sqrt(x::Union{Float64, Float32, Float16}) = sqrt_llvm(x)
``````

This way we do not loose performance when calling `sqrt` and `log` with floats, which is good enough for our code.

2 Likes