Fastest `sqrt` and `log` with negative check

NaNMath.jl implements special functions such as sqrt and log which return NaN if called with negative argument.

The implementation of sqrt is given by

sqrt(x::T) where {T<:AbstractFloat} = x < 0.0 ? T(NaN) : Base.sqrt(x)

while log is implemented as

for f in (:sin, :cos, :tan, :asin, :acos, :acosh, :atanh, :log, :log2, :log10,
          :lgamma, :log1p)
    @eval begin
        ($f)(x::Float64) = ccall(($(string(f)),libm), Float64, (Float64,), x)
        ($f)(x::Float32) = ccall(($(string(f,"f")),libm), Float32, (Float32,), x)
        ($f)(x::Real) = ($f)(float(x))
    end
end

which is (in my case) (slightly) slower than

log(x::T) where {T<:AbstractFloat} = x < 0.0 ? T(NaN) : Base.log(x)

I also tried

function log(x::AbstractFloat)
    if x < 0
        return convert(typeof(x), NaN)
    end
    return Base.log(x)
end

which seems to fall down to the the other implementation using the conditional assignment operator ?.

Any ideas on even more efficient implementations?

I think Core.Intrinsics.sqrt_llvm and Core.Intrinsics.sqrt_llvm_fast (which should be equivalent to @fastmath sqrt) are as fast as you can get, they seem to lower to @llvm.sqrt and then to a single assembly instruction with a Float64.

Never mind, didn’t realise you wanted NaN to be returned if negative. No idea if this is guaranteed to return NaN for negative values.

3 Likes

I don’t love some of the original definitions in any case. The x < 0.0 checks cause unnecessary promotion (e.g., x::Float32 < 0.0::Float64 converts x to Float64 before comparing). I would have written the check as x < zero(T) or x < zero(x) (without a clear opinion as to which is “better”).

The fact that log/trig functions defer directly to libm seems unnecessary. Even if this were marginally faster (you say it’s slower), this seems somewhat un-Julian. I would have given them analogous definitions to the sqrt one (like you suggested), where they branch to NaN on invalid inputs and otherwise defer to their Base counterparts.

I guess going straight to libm might be convenient (if the semantics are correct, which I assume they are) because it means they don’t have to encode the domains of these functions in these definitions? But a consequence is that none of these log/trig functions look like they’ll work for Float16 (it looks like they’ll stack-overflow on the ::Real method) which seems like a silly oversight.

5 Likes

Interestingly, Core.Intrinsics.sqrt_llvm and Core.Intrinsics.sqrt_llvm_fast return both NaN for negative floating point arguments

julia> @fastmath sqrt(-42.0)
NaN

julia> Core.Intrinsics.sqrt_llvm(-42.0)
NaN

julia> Core.Intrinsics.sqrt_llvm_fast(-42.0)
NaN

@fastmath sqrt returns an error for negative ints

julia> @fastmath sqrt(-42)
ERROR: DomainError with -42.0:
sqrt will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).
Stacktrace:
 [1] throw_complex_domainerror(f::Symbol, x::Float64)
   @ Base.Math ./math.jl:33
 [2] sqrt
   @ ./math.jl:677 [inlined]
 [3] sqrt
   @ ./math.jl:1491 [inlined]
 [4] sqrt_fast(x::Int64)
   @ Base.FastMath ./fastmath.jl:349
 [5] top-level scope
   @ REPL[6]:1

And Core.Intrinsics.sqrt_llvm, Core.Intrinsics.sqrt_llvm_fast return gibberish for int arguments (both positive and negative):

julia> Core.Intrinsics.sqrt_llvm_fast(42)
2200549683878040164

julia> Core.Intrinsics.sqrt_llvm(42)
2200549683878040164

julia> Core.Intrinsics.sqrt_llvm(-42)
-42

julia> Core.Intrinsics.sqrt_llvm_fast(-42)
-42

I imagine if you pass an int it’s being reinterpreted to a float.

This is indeed what happens:

julia> Core.Intrinsics.sqrt_llvm(42)
2200549683878040164

julia> reinterpret(Int, Core.Intrinsics.sqrt_llvm(reinterpret(Float64, 42)))
2200549683878040164

julia> Core.Intrinsics.sqrt_llvm(-42)
-42

julia> reinterpret(Int, Core.Intrinsics.sqrt_llvm(reinterpret(Float64, -42)))
-42

sqrt (and log) is inherently slow, but does not involve a test, can be done in one assembly instruction without in it (well don’t by the FPU). Maybe log can too in CPUs, but I think done in software, sometimes that’s faster than the single instruction…

Branches, at least not-taken/unpredictable are slow, and probably hinder SIMD. I think with non-SIMD it’s actually not slower, with sqrt hiding its latency.

@btime @fastmath log(19000)  # I DO get (only) 3.6% faster for some larger values.

But might lose that advantage, without that macro I get a function call, and inlined might seem to be the reason it’s faster, but the much larger inlined code might kill performance elsewhere if not fitting in L1 cache. Maybe the optimizer is clever and would then NOT inline. Also does a call always kill SIMD performance, I think so…

julia> @btime @fastmath NaNMath.log(19000)
  13.751 ns (0 allocations: 0 bytes)
9.852194258148577

julia> @btime log(19000)
  1.877 ns (0 allocations: 0 bytes)
9.852194258148577

Intriguing, and same slow performance without the macro…

Possibly throwing isn’t that bad, it needs a check, which at least without SIMD, isn’t too bad. Throwing is most probably slower, I think in Julia as in C++ throw is no-cost vs if, if not thrown, more costly if thrown in both languages.

I would test with SIMD vectors, and while Julia likely checks each number separately I think maybe you could check all at once if any number is negative, as a fast path?

You may want to avoid the check.

something like:

assert(x >= 0)
sqrt(x)

That’s likely dangerous (except with an unsigned type), but got me thinking why don’t we have Unsigned floats? You’re likely pushing the checks around then, to some earlier cast to it. But might it be useful? One sign bit seems not to bad of an overhead, but you could do a UFloat8 or UFloat16, and then the extra bit of accuracy you gain might be valuable. Maybe for neural networks, or do all parameters require negative too possible? For some other app? All operations could be fast except minus which would need to check and throw or give a NaN8.

I hope[d] not, it’s slower!

julia> @btime Core.Intrinsics.sqrt_llvm_fast(1.9)
  3.872 ns (0 allocations: 0 bytes)
1.378404875209022

julia> @btime Core.Intrinsics.sqrt_llvm(1.9)
  1.950 ns (0 allocations: 0 bytes)
1.378404875209022

at least on 1.9.4.

julia> @btime @fastmath sqrt(1.9)
  3.361 ns (0 allocations: 0 bytes)
1.378404875209022

julia> @btime @fastmath sqrt(19)
  1.875 ns (0 allocations: 0 bytes)
4.358898943540674

I.e. @fastmath made slower, and never faster for what I tested. Maybe this is a regression.

@btime Core.Intrinsics.sqrt_llvm($(Ref(1.9))[]) and @btime Core.Intrinsics.sqrt_llvm_fast($(Ref(1.9))[]) are the same speed for me. The assembly is the same as well so probably just a benchmarking problem.

2 Likes

@fastmath only applies to floating point operations, which is why it is a no-op in @fastmath(sqrt(-42)) and an error is still thrown as if it were called without @fastmath.

But you should absolutely not use @fastmath sqrt(x) with the intent to return NaN for this operation. This is because @fastmath enables the “no nans” optimization flag. This means that an input or output NaN results in undefined behavior. So, really, you should never use @fastmath sqrt(x) unless you know x >= 0.

This example is a tad contrived, but it’s a minimal demonstration of the issue. We have a function that wants to return @fastmath sqrt(x) if the result is non NaN and some fallback value y otherwise (e.g., if the input is negative or NaN).

julia> function sqrt_ifnotnan(x, y)
               r = @fastmath sqrt(x)
               return isnan(r) ? y : r
       end
sqrt_ifnotnan (generic function with 1 method)

julia> @code_llvm debuginfo=:none sqrt_ifnotnan(-3.0,-1.0)
define double @julia_sqrt_ifnotnan_545(double %0, double %1) #0 {
top:
  %2 = call fast double @llvm.sqrt.f64(double %0)
    ; where is the code to return y?
  ret double %2
}

julia> sqrt_ifnotnan(-3.0,-1.0)
NaN # wasn't this supposed to return -1.0 instead of NaN?

This may look overly-contrived, but the no-nans flag is a battleship-sized footgun. It will burn you sooner or later in some absurd situation unless you are extraordinarily careful.

llvm.sqrt documents the following:

Return the same value as a corresponding libm ‘sqrt’ function but without trapping or setting errno. For types specified by IEEE-754, the result matches a conforming libm implementation.

It’s hard to find authoritative-looking documentation of libm semantics, but a few things I’ve found do seem to suggest that negative values are indeed defined to return NaN (and also raise a floating point exception, although LLVM makes the amendment that no exception is raised) rather than being undefined.

This means that Core.Intrinsics.llvm_sqrt is probably valid for a return-NaN-on-negative semantic even though @fastmath(sqrt(x)) is not. But I’m always wary of calling intrinsics.

Note that you aren’t interpolating inputs so don’t give full faith to benchmarking results. Intrinsics behave funny in ways that normal functions usually don’t. It’s better to define sqrt_from_llvm(x) = Core.Intrinsics.sqrt_llvm(x).

I’ll remark that sqrt is already a single native instruction that’s pretty fast on most hardware. @fastmath doesn’t do anything faster (except skip the x < 0 check) unless there is a faster version of sqrt available with reduced accuracy (not on most systems).

Note that skipping the error check isn’t that hard. For example sqrt(abs(x)) knows that abs(x) < 0 is impossible, so it doesn’t check (and abs is cheap). So a code snippet like ifelse(x < zero(x), oftype(x,NaN), sqrt(abs(x))) won’t have any code for throwing. As scalar code it still held a branch, but if vectorized it might eliminate that. Or sqrt(ifelse(x == abs(x), abs(x), oftype(x,NaN))) had neither branches nor errors when I checked. Although either of these do add a bit more logic and change the return value for x = -0.0 to +0.0.

1 Like

You want to generate ONE assembly instruction. I didn’t confirm if it returns NaN for negative on x86 or an exception, but on ARM (vector instruction):

https://developer.arm.com/documentation/ddi0596/2020-12/SIMD-FP-Instructions/FSQRT--vector---Floating-point-Square-Root--vector--

This instruction can generate a floating-point exception. Depending on the settings in FPCR, the exception results in either a flag being set in FPSR or a synchronous exception being generated. For more information, see Floating-point exception traps.
[…]
It has encodings from 2 classes: Half-precision and Single-precision and double-precision

Those are also the same for me (or if not, bench-marking noise difference).

How did you see the code, these do not work:

julia> @code_native Core.Intrinsics.sqrt_llvm_fast((Ref(1.9))[])
ERROR: ArgumentError: argument is not a generic function

julia> @code_native Core.Intrinsics.sqrt_llvm(1.9)  # this one IS faster but probably since optimized to a constant...
ERROR: ArgumentError: argument is not a generic function

I was kind of expecting a throw on negative, but the domain error does disappear, and I DID get NaN for @fastmath sqrt(-1.0) and seemingly any negative value. You’re saying it’s undefined and not to be trusted, but maybe it can/will never change?

No, not even this skips the check:

x = -1.0; @code_native sqrt(@inline abs(x))

but it should. Neither the former does, not should it for sure, and neither would give you a NaN.

As I warned, intrinsics act funny and should be wrapped in normal functions:

julia> code_native(x->Core.Intrinsics.sqrt_llvm(x), (Float64,); debuginfo=:none)
        .text
        .file   "#77"
        .globl  "julia_#77_613"                 # -- Begin function julia_#77_613
        .p2align        4, 0x90
        .type   "julia_#77_613",@function
"julia_#77_613":                        # @"julia_#77_613"
        .cfi_startproc
# %bb.0:                                # %top
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset %rbp, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register %rbp
        vsqrtsd %xmm0, %xmm0, %xmm0
        popq    %rbp
        .cfi_def_cfa %rsp, 8
        retq
.Lfunc_end0:
        .size   "julia_#77_613", .Lfunc_end0-"julia_#77_613"
        .cfi_endproc
                                        # -- End function
        .section        ".note.GNU-stack","",@progbits

Note that this @code_native is just the @code_native sqrt(x). The macro requires care for nested function calls. I prefer to use the function form of code_native for inline-defined functions:

julia> @code_native debuginfo=:none (x->sqrt(abs(x)))(1.0) # macro form

julia> code_native(x->sqrt(abs(x)), (Float64,); debuginfo=:none) # function form

# but the code_llvm is clearer in this case:

julia> code_llvm(x->sqrt(x), (Float64,); debuginfo=:none)
define double @"julia_#87_627"(double %0) #0 {
top:
  %1 = fcmp uge double %0, 0.000000e+00
  br i1 %1, label %L5, label %L3

L3:                                               ; preds = %top
  call void @j_throw_complex_domainerror_629({}* inttoptr (i64 140632027094512 to {}*), double %0) #2
  unreachable

L5:                                               ; preds = %top
  %2 = call double @llvm.sqrt.f64(double %0)
  ret double %2
}

julia> code_llvm(x->sqrt(abs(x)), (Float64,); debuginfo=:none)
define double @"julia_#89_630"(double %0) #0 {
top:
  %1 = call double @llvm.fabs.f64(double %0)
  %2 = call double @llvm.sqrt.f64(double %1)
  ret double %2
}

I’m saying that @fastmath is allowed to emit undefined behavior when values are NaN. Even if the result of @fastmath sqrt(-1.0) is actually properly defined to be NaN, using that value in any subsequent operation causes undefined behavior because @fastmath has tagged this result as “no-nan”. See the example in my above post where an isnan check was dropped following a @fastmath sqrt. No-nan and no-inf are (in my opinion) usually the most dangerous @fastmath flags and were the main motivators for me opening this issue.

2 Likes

I don’t think this is a concern in this case as the promotion can be done at compile time. Checking with @code_llvm shows that the literal is converted to the appropriate type. So perhaps the “cleanest” way to write it would be simply x < 0.

Hmmm. It looks like you’re mostly right. For the BitInteger and IEEEFloat types I tried the 0.0 gets “demoted” by LLVM to a suitable type. Although Rational{Int} still results in an mixed-type comparison, comparing it to 0 works well.

BigInt and BigFloat give different code with 0 and zero(x), but it’s too messy for me to bother to make sense of. It’s possible that 0 is actually more efficient for them.

Although I would still encourage people to be pedantic and use zero(x) or at least false (the narrowest Real zero). LLVM catches most of the Base types with just 0 but the package ecosystem is a wide world.

1 Like

As you show this code Core.Intrinsics.sqrt_llvm(x) is better, gives you one instruction, no check, but does it work for SIMD (likely neither works for non-machine types, e.g. Posits; nor does for integers, but Julia should likely do a fallback for them)? I think not, so what’s the best fast alternative?

I confirm the abs trick works and is likely fast, also for SIMD:

code_native(x->sqrt(@inline abs(x)), (Float64,); debuginfo=:none)  # also works down to Float16, and likely Posits

but it does not give you NaN (or NaR for Posits) for negative numbers, so you must be careful. Rather what you “ask” for, in that case, which is wrong.

I looked at hypot which implies a square root and it just uses sqrt, not sqrt_llvm, so I’m not sure it’s as fast as it could be.

In converting a float value to a posit value, all forms of infinity and NaN convert to NaR.

2 Likes

Thanks for the Cult of Posits link!

1 Like

It looks like there isn’t really performance to be improved here:

julia> using BenchmarkTools

julia> x = randn(Float64, 2^10); y = similar(x); # includes negatives

julia> @btime broadcast!(x -> Core.Intrinsics.sqrt_llvm(x), $y, $x); # intrinsic
  692.568 ns (0 allocations: 0 bytes)

julia> @btime broadcast!(x -> x < zero(x) ? oftype(x,NaN) : sqrt(x), $y, $x); # NaNMath
  692.568 ns (0 allocations: 0 bytes)

julia> x = rand(Float64, 2^10); y = similar(x); # no negatives

julia> @btime broadcast!(sqrt, $y, $x); # Base
  1.420 Îźs (0 allocations: 0 bytes)

Here we see that the NaNMath implementation manages to match the performance of the intrinsic on my machine on this problem. We see that the Base.sqrt call is 2x slower (when it doesn’t throw a domain error), with the dominant effect likely being that the branches prevent SIMD.

I’ll emphasize that the NaNMath implementation manged to SIMD here, just like the intrinsic. Here I’ve copied the core loop on x86 for the NaNMath and intrinsic versions:

julia> @code_native debuginfo=:none broadcast!(x -> Core.Intrinsics.sqrt_llvm(x), y, x);
...
.LBB0_13:                               # %vector.body
                                        # =>This Inner Loop Header: Depth=1
        vsqrtpd (%rax,%rsi,8), %ymm0
        vmovupd %ymm0, (%rcx,%rsi,8)
        addq    $4, %rsi
        cmpq    %rsi, %rdx
        jne     .LBB0_13
...


julia> @code_native debuginfo=:none broadcast!(x -> x < zero(x) ? oftype(x,NaN) : sqrt(x), y, x);
...
.LBB0_13:                               # %vector.body
                                        # =>This Inner Loop Header: Depth=1
        vmovupd (%rax,%rdi,8), %ymm2
        vcmpnltpd       %ymm0, %ymm2, %ymm3
        vsqrtpd %ymm2, %ymm2
        vblendvpd       %ymm3, %ymm2, %ymm1, %ymm2
        vmovupd %ymm2, (%rcx,%rdi,8)
        addq    $4, %rdi
        cmpq    %rdi, %rdx
        jne     .LBB0_13
...

In the NaNMath version, the SIMD sqrt is executed unconditionally. Then a blend is used to replace the result with NaN if the domain was invalid. This requires an extra 3 instructions (only 2 interesting ones) compared to the intrinsic, but instruction latency or some other effect (the 2 8KiB vectors for my benchmark easily fit within my L1 cache, so memory bandwidth is an unlikely culprit) apparently manages to squash the difference.

There’s literally a performance cost to these extra instructions, but it’s small enough to be unobservable in most contexts (including the above tight loop of only sqrt calls and loads/stores). I don’t see value in mucking with internals for this, even though it would be very easy to do.


hypot does so much extra stuff (relative to a naive sqrt(abs2(x)+abs2(y))) that the check would be a very small cost even if it weren’t removed. But if you check the code_llvm(Base.Math._hypot, NTuple{2,Float32}; debuginfo=:none) (or the Float64 version, although it is considerably more complicated) you’ll see that a naked sqrt is called with no domain check because the compiler can tell that the argument is structurally nonnegative. There are many cases where the compiler can avoid domain checks.

log is expensive enough (it doesn’t have native instructions, at least on most general-purpose hardware), that I don’t think there’s notable performance to be reclaimed from the domain checks (and you have to introduce the NaNs manually, anyway, so it’s just a difference of branch-to-error or a conditional move/branch-to-NaN). But the compiler may skip many of them anyway for the same reason as it did in hypot.

2 Likes

We are continuing now with the following implementations:

@inline sqrt(x::Real) = x < zero(x) ? oftype(x, NaN) : Base.sqrt(x)

@inline sqrt(x::Float64) = ccall("llvm.sqrt.f64", llvmcall, Float64, (Float64,), x)
@inline sqrt(x::Float32) = ccall("llvm.sqrt.f32", llvmcall, Float32, (Float32,), x)
@inline sqrt(x::Float16) = ccall("llvm.sqrt.f16", llvmcall, Float16, (Float16,), x)

@inline log(x::Real) = x < zero(x) ? oftype(x, NaN) : Base.log(x)

@inline log(x::Float64) = ccall("llvm.log.f64", llvmcall, Float64, (Float64,), x)
@inline log(x::Float32) = ccall("llvm.log.f32", llvmcall, Float32, (Float32,), x)
@inline log(x::Float16) = ccall("llvm.log.f16", llvmcall, Float16, (Float16,), x)

while one could replace the Float-specialized sqrt calls by the in this post mentioned sqrt_llvm as

@inline sqrt(x::Union{Float64, Float32, Float16}) = sqrt_llvm(x)

This way we do not loose performance when calling sqrt and log with floats, which is good enough for our code.

2 Likes