Disabling DomainError (in sqrt())?

Is there any way one can remove the negative number check for sqrt(Float64)? Neither @inbounds or @fastmath seems to do the trick (by checking the generated @code_native). However, it seems that, say, LinearAlgebra.norm() avoids this check when it calls sqrt.

NaNMath.jl might be useful

@fastmath should avoid it, but maybe you have a custom number type (like ForwardDiff.Dual) that hits the fallback, checked version?

what about wrapping the sqrt function into a non failing alternative?

function mysqrt(x::Real)
    in_domain = x >= zero(x)
    Base.sqrt(ifelse(in_domain,x,zero(x)/zero(x))
end

That still has a check, and if you’re doing this for speed then you can do Base.sqrt_llvm(1.0) which is the same (just not exported). And If you’re really doing this for speed them maybe Float64(Base.sqrt_llvm(Float32(1.0)))?

Bypassing the check is likely to be one assembly instruction as opposed to 9, but I’m unsure why I can’t confirm:

julia> @code_native Base.sqrt_llvm(9.0)   # while this works: @code_native sqrt(9.0)
ERROR: ArgumentError: argument is not a generic function

Another way to disable the error (and that check) is:

julia> sqrt(Complex(-9.0))  I'm very unclear why I'm seeing $j_throw_complex_domainerror_2023 with with @code_lowered sqrt(Complex(-9.0))
0.0 + 3.0im
julia> fastsqrt(x) = @fastmath sqrt(x)
fastsqrt (generic function with 1 method)

julia> fastsqrt(-4)
ERROR: DomainError with -4.0:
julia> fastsqrt(x) = @fastmath sqrt(x)
fastsqrt (generic function with 1 method)

julia> fastsqrt(-4)
ERROR: DomainError with -4.0:
sqrt will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).
Stacktrace:
 [1] throw_complex_domainerror(f::Symbol, x::Float64)
   @ Base.Math ./math.jl:33
 [2] sqrt
   @ ./math.jl:591 [inlined]
 [3] sqrt
   @ ./math.jl:1372 [inlined]
 [4] sqrt_fast
   @ ./fastmath.jl:349 [inlined]
 [5] fastsqrt(x::Int64)
   @ Main ./REPL[1]:1
 [6] top-level scope
   @ REPL[2]:1

julia> fastsqrt(-4.0)
NaN

julia> fastsqrt(-4.0f0)
NaN32

Like I said, weird number types (like Int) can hit the fallback, checked version. Note the stacktrace; it’s just calling regular sqrt.
I’d only trust @fastmath to help with Float64 and Float32.

1 Like

Only for arrays of BLAS types (Float32 and Float64), since norm calls BLAS.nrm2 there (which calls sqrt internally, presumably using the system math library). Otherwise, norm calls the ordinary Julia sqrt function.

I’m guessing that’s all @bremez cares above if this is about performance optimization? If it’s about wanting NaNs, then NaNMath.jl as suggested by @Oscar_Smith seems like the way to go.

NaNMath.sqrt, despite seemingly doing more work, is likely to help with performance optimization as well, as its check can be optimized, while Base.sqrt, throwing an error, cannot.

julia> x = rand(256); y = similar(x);

julia> @benchmark @. $y = sqrt($x)
BenchmarkTools.Trial: 10000 samples with 227 evaluations.
 Range (min … max):  334.670 ns …  1.052 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     334.780 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   336.468 ns ± 10.455 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █    ▃▁ ▁ ▁                                                  ▁
  █▄▄▄▃██▇███▇▆▅▄▄▄▅▅▄▅▄▇▅▃▄▅▅▄▄▄▃▃▃▇▁▄▃▄▆▄▅▃▃▅▃▄▄▅▄▅▅▄▄▄▄▅▄▅█ █
  335 ns        Histogram: log(frequency) by time       376 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark @. $y = Base.FastMath.sqrt_fast($x)
BenchmarkTools.Trial: 10000 samples with 755 evaluations.
 Range (min … max):  167.358 ns … 432.384 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     167.364 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   168.638 ns ±   5.218 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █  ▄▂                                                      ▂  ▁
  █▆▁███▇▇▆▅█▆▆▆▆▆▅▆▆▄▅▅█▇▆▅▅▆▅▅▅▅▅▅▅▆▆▆▅▅▆▅▆▄▅▅▅▄▆▅▅▅▅▅▅▅▅▅▅█▆ █
  167 ns        Histogram: log(frequency) by time        188 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark @. $y = NaNMath.sqrt($x)
BenchmarkTools.Trial: 10000 samples with 755 evaluations.
 Range (min … max):  167.358 ns … 526.793 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     167.366 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   168.457 ns ±   5.773 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █  ▄▂                                                      ▁  ▁
  █▆▁████▇▆▅█▆▆▅▆▆▅▆▅▅▆▅█▇▅▅▅▅▅▅▅▅▆▄▆▄▅▅▅▅▅▅▅▅▄▄▄▄▅▅▄▃▄▅▄▅▆▆▅█▇ █
  167 ns        Histogram: log(frequency) by time        188 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

This was done on an Intel CPUs, and to date, all Intel CPUs can only do 128 bits of sqrt per clock cycle, so SIMD only gives us a 2x speedup.
That is, using YMM registers takes twice as long as using XMM registers, although operating on only a single element of an XMM register is the same speed,

Note that this doesn’t hold for AMD Ryzen. For Zen2 and Zen3 in particular, it takes the same time regardless of 1,2, or 4 elements, meaning the vectorized form should be 4x (rather than 2x) faster. It is much slower than Intel for 1 or 2 elements, however.

So in that case, I’d suggest going with NaNMath; it avoids throwing an error, and it is the maybe throwing an error that is expensive. Not the check, which is essentially free compared to the expensive sqrt, and the check can be vectorized.

Note that because it is implemented as

sqrt(x::Real) = x < 0.0 ? NaN : Base.sqrt(x)

it’ll return a Float64 unconditionally when < 0.0, so NaNMath.sqrt(4f0) will be type unstable.
ForwardDiff.Dual overloads it though, so it’ll be type stable with ForwardDiff.Duals, which is the primary special number type I care about.

Wait, are you sure @fastmath sqrt doesn’t do this already? On 1.9 I’m seeing:

f(x) = @fastmath sqrt(x)
@code_native debuginfo=:none f(-1.)
	.text
	.file	"f"
	.globl	julia_f_317                     # -- Begin function julia_f_317
	.p2align	4, 0x90
	.type	julia_f_317,@function
julia_f_317:                            # @julia_f_317
	.cfi_startproc
# %bb.0:                                # %top
	vsqrtsd	%xmm0, %xmm0, %xmm0
	retq
.Lfunc_end0:
	.size	julia_f_317, .Lfunc_end0-julia_f_317
	.cfi_endproc
                                        # -- End function
	.section	".note.GNU-stack","",@progbits

What’re you asking?
I was saying it gets rid of the errors for Float64 and Float32, but otherwise it’ll hit Base.sqrt as a fallback.

NaNMath was equally fast, because it got rid of the error, even though LLVM failed to get rid of the check

        vxorpd  xmm1, xmm1, xmm1
        vucomisd        xmm1, xmm0
        ja      .LBB0_1
# %bb.2:                                # %select.true.sink
        vsqrtsd xmm0, xmm0, xmm0
        ret
.LBB0_1:
        movabs  rax, offset .LCPI0_0
        vmovsd  xmm0, qword ptr [rax]           # xmm0 = mem[0],zero
        ret

Thank you everyone, this are all good suggestions!

Indeed, the issue was that I am calling sqrt on Int64s. I was not aware that this is considered a “weird type” - are there any other pathologies I should be aware of? With root-taking, perhaps the correct way to go about this is to manually convert to a floating type and then call @fastmath sqrt()?

Just out of curiosity, how do the check implementations differ so that one is optimizable and one is not?

Interesting! I expected that since there’s instruction level root taking, it wouldn’t matter if it is operating on 32 or 64 bits, but I’m seeing a significant speed boost in with 32. I am working on relatively small integers which should comfortably fit in Float32’s range, so that might be the best way to go!

Also interesting! Is it indeed a 128 bits limitation, or a 2 operands limitation? i.e. would it be able to operate on four Float32s?

Yes, four Float32s.

You can use Cthulhu.@descend to look at the typed Julia code, LLVM IR, and assembly.

Ints are for things like indexing arrays, not numerical computation IMO.
The integer has to be converted to a floating point number first before getting the root, anyway. It must also be converted before operating on it with other floating point numbers.
Other possible issues are things like overflow: The speed of light is an integer. Why should we care?

Float64 gives you vsqrtsd and Float32 and Float16 vsqrtss which is faster, but on some hypothetical CPU (or some historical RISC or current have sqrt approximation only?) you might not have sqrt in hardware (GPUs?), or say only for Float16, and then this might help:

julia> Float64(Base.sqrt_llvm(Float16(10000.0)))  # 1000000.0 will give you an Inf.
100.0

There might be some trick is software (but unlikely to beat hardware) like in this case if you can go away with low precision:

If you work with logarithmic numbers then * simplifies to +, and power or square root should be * (and /, which is really * for constant root)?

The problem is + and - get slower…

It’s on you to make sure you don’t put in negative number, I’m not sure you can rely on NaN given, it might be CPU specific (“implementation defined”? At least when no CPU instruction, I’m not sure what most likely algorithm would do)?

julia> Base.sqrt_llvm(-10000.0)
NaN

There are hardware fast inverse square root instructions that are also inaccurate, but faster than this. With @fastmath, the compiler will generate them for you + add a Newton Raphson iteration to improve accuracy.

julia> function fastinvsqrt!(y,x)
           @inbounds @fastmath for i = eachindex(y,x)
               y[i] = 1/sqrt(x[i])
           end
       end
fastinvsqrt! (generic function with 1 method)

julia> @code_native debuginfo = :none syntax = :intel fastinvsqrt!(Float32[], Float32[])
1 Like

So to improve accuracy (only one iteration, which doubles number of correct bits?), but not to full? How would you just get this instruction which is vrsqrtps (and vrsqrtss), i.e. lowest accuracy (if ok for you)?

What your @code_native give me is a lot longer than just that one or two instructions… but the core part isn’t that long basically 7, while it seemingly could be only the first one:

rsqrt(x) = @fastmath 1/sqrt(x)
@code_native rsqrt(10f0)  # it's a lot longer if not called with Float32

	vrsqrtss	%xmm0, %xmm0, %xmm1
	vmulss	%xmm1, %xmm0, %xmm0
	movabsq	$.LCPI0_0, %rax
	vfmadd213ss	(%rax), %xmm1, %xmm0    # xmm0 = (xmm1 * xmm0) + mem
	movabsq	$.LCPI0_1, %rax
	vmulss	(%rax), %xmm1, %xmm1
	vmulss	%xmm0, %xmm1, %xmm0
	retq

https://www.felixcloutier.com/x86/rsqrtps

The relative error for this approximation is:

|Relative Error| ≤ 1.5 ∗ 2^−12

unclear this applies, googling for those instructions I found RSQRTSS and I n’t sure about the -ps ending.

vrsqrtps executes once-per-cycle throughput on Skylake / Icelake.

You can always compute newton’s method afterwards to improve the accuracy. But getting the maximum accuracy from one cycle is probably best.

https://cdrdv2.intel.com/v1/dl/getContent/671488
See page 625, i.e. section 18.25.1 for a table.
RSQRT14PS + 1 Newton-Raphson iteration is actually more accurate (23 bits) than sqrt + div, making it another case where @fastmath increases accuracy; the transform speeds up code and makes it more accurate, but changing results isn’t allowed, which is why it is forbidden by default.

Although this is for the AVX512 version, which is a little bit more accurate than vrsqrtss, and also exists in double precision flavors (which have the same precision as the single precision versions, only they operate on Float64).

The “ps” ending meeans “packed single”, i.e. packed (vector) of single-precision.
“ss” means “single single” (operating on just 1 Float32)
“sd” means “single double” (operating on just 1 Float64)
“ps” means “packed single” (operating on a vector of Float32)
“pd” means “packed double” (operating on a vector of Float64)

For me, it is also heavily unrolled.
But yes, it could just be the reciprical square root. The rest is to fix the accuracy.

You could llvmcall it. You’re welcome to make a PR to VectorizationBase.

You could basically just copy inv_fast code there and make a approx_inv_sqrt, and an inv_sqrt that adds the NR iterations.

2 Likes

That’s very intriguing (what’s the other case?) likely why the (2022) Posit standard requires also e.g. rSqrt(posit). I’m not sure if IEEE does too by now (I doubt, maybe only recommends having it?). At least either way it seems like Julia should have such a function, since not a lot of people would know this trivia, and it could use that exact same @fastmath code and be documented from e.g. sqrt (and cbrt too? for similar, though I doubt CPUs have hardware support). Then I guess for that function @fastmath could be implemented to do even faster skipping the single Newton-Raphson iteration, and would have 23/2 = 11 correct bits.

We generally target our @fastmath at much higher accuracy than that.

1 Like

Ok, since it would be a function, it could be a keyword arg.