Is there any way one can remove the negative number check for
@fastmath seems to do the trick (by checking the generated
@code_native). However, it seems that, say,
LinearAlgebra.norm() avoids this check when it calls
Is there any way one can remove the negative number check for
NaNMath.jl might be useful
@fastmath should avoid it, but maybe you have a custom number type (like
ForwardDiff.Dual) that hits the fallback, checked version?
what about wrapping the
sqrt function into a non failing alternative?
function mysqrt(x::Real) in_domain = x >= zero(x) Base.sqrt(ifelse(in_domain,x,zero(x)/zero(x)) end
That still has a check, and if you’re doing this for speed then you can do Base.sqrt_llvm(1.0) which is the same (just not exported). And If you’re really doing this for speed them maybe
Bypassing the check is likely to be one assembly instruction as opposed to 9, but I’m unsure why I can’t confirm:
julia> @code_native Base.sqrt_llvm(9.0) # while this works: @code_native sqrt(9.0) ERROR: ArgumentError: argument is not a generic function
Another way to disable the error (and that check) is:
julia> sqrt(Complex(-9.0)) I'm very unclear why I'm seeing $j_throw_complex_domainerror_2023 with with @code_lowered sqrt(Complex(-9.0)) 0.0 + 3.0im
julia> fastsqrt(x) = @fastmath sqrt(x) fastsqrt (generic function with 1 method) julia> fastsqrt(-4) ERROR: DomainError with -4.0:
julia> fastsqrt(x) = @fastmath sqrt(x) fastsqrt (generic function with 1 method) julia> fastsqrt(-4) ERROR: DomainError with -4.0: sqrt will only return a complex result if called with a complex argument. Try sqrt(Complex(x)). Stacktrace:  throw_complex_domainerror(f::Symbol, x::Float64) @ Base.Math ./math.jl:33  sqrt @ ./math.jl:591 [inlined]  sqrt @ ./math.jl:1372 [inlined]  sqrt_fast @ ./fastmath.jl:349 [inlined]  fastsqrt(x::Int64) @ Main ./REPL:1  top-level scope @ REPL:1 julia> fastsqrt(-4.0) NaN julia> fastsqrt(-4.0f0) NaN32
Like I said, weird number types (like
Int) can hit the fallback, checked version. Note the stacktrace; it’s just calling regular
I’d only trust
@fastmath to help with
Only for arrays of BLAS types (
BLAS.nrm2 there (which calls
sqrt internally, presumably using the system math library). Otherwise,
norm calls the ordinary Julia
NaNMath.sqrt, despite seemingly doing more work, is likely to help with performance optimization as well, as its check can be optimized, while
Base.sqrt, throwing an error, cannot.
julia> x = rand(256); y = similar(x); julia> @benchmark @. $y = sqrt($x) BenchmarkTools.Trial: 10000 samples with 227 evaluations. Range (min … max): 334.670 ns … 1.052 μs ┊ GC (min … max): 0.00% … 0.00% Time (median): 334.780 ns ┊ GC (median): 0.00% Time (mean ± σ): 336.468 ns ± 10.455 ns ┊ GC (mean ± σ): 0.00% ± 0.00% █ ▃▁ ▁ ▁ ▁ █▄▄▄▃██▇███▇▆▅▄▄▄▅▅▄▅▄▇▅▃▄▅▅▄▄▄▃▃▃▇▁▄▃▄▆▄▅▃▃▅▃▄▄▅▄▅▅▄▄▄▄▅▄▅█ █ 335 ns Histogram: log(frequency) by time 376 ns < Memory estimate: 0 bytes, allocs estimate: 0. julia> @benchmark @. $y = Base.FastMath.sqrt_fast($x) BenchmarkTools.Trial: 10000 samples with 755 evaluations. Range (min … max): 167.358 ns … 432.384 ns ┊ GC (min … max): 0.00% … 0.00% Time (median): 167.364 ns ┊ GC (median): 0.00% Time (mean ± σ): 168.638 ns ± 5.218 ns ┊ GC (mean ± σ): 0.00% ± 0.00% █ ▄▂ ▂ ▁ █▆▁███▇▇▆▅█▆▆▆▆▆▅▆▆▄▅▅█▇▆▅▅▆▅▅▅▅▅▅▅▆▆▆▅▅▆▅▆▄▅▅▅▄▆▅▅▅▅▅▅▅▅▅▅█▆ █ 167 ns Histogram: log(frequency) by time 188 ns < Memory estimate: 0 bytes, allocs estimate: 0. julia> @benchmark @. $y = NaNMath.sqrt($x) BenchmarkTools.Trial: 10000 samples with 755 evaluations. Range (min … max): 167.358 ns … 526.793 ns ┊ GC (min … max): 0.00% … 0.00% Time (median): 167.366 ns ┊ GC (median): 0.00% Time (mean ± σ): 168.457 ns ± 5.773 ns ┊ GC (mean ± σ): 0.00% ± 0.00% █ ▄▂ ▁ ▁ █▆▁████▇▆▅█▆▆▅▆▆▅▆▅▅▆▅█▇▅▅▅▅▅▅▅▅▆▄▆▄▅▅▅▅▅▅▅▅▄▄▄▄▅▅▄▃▄▅▄▅▆▆▅█▇ █ 167 ns Histogram: log(frequency) by time 188 ns < Memory estimate: 0 bytes, allocs estimate: 0.
This was done on an Intel CPUs, and to date, all Intel CPUs can only do 128 bits of
sqrt per clock cycle, so SIMD only gives us a 2x speedup.
That is, using YMM registers takes twice as long as using XMM registers, although operating on only a single element of an XMM register is the same speed,
Note that this doesn’t hold for AMD Ryzen. For Zen2 and Zen3 in particular, it takes the same time regardless of 1,2, or 4 elements, meaning the vectorized form should be 4x (rather than 2x) faster. It is much slower than Intel for 1 or 2 elements, however.
So in that case, I’d suggest going with
NaNMath; it avoids throwing an error, and it is the maybe throwing an error that is expensive. Not the check, which is essentially free compared to the expensive
sqrt, and the check can be vectorized.
Note that because it is implemented as
sqrt(x::Real) = x < 0.0 ? NaN : Base.sqrt(x)
it’ll return a
Float64 unconditionally when
< 0.0, so
NaNMath.sqrt(4f0) will be type unstable.
ForwardDiff.Dual overloads it though, so it’ll be type stable with
ForwardDiff.Duals, which is the primary special number type I care about.
Wait, are you sure
@fastmath sqrt doesn’t do this already? On 1.9 I’m seeing:
f(x) = @fastmath sqrt(x) @code_native debuginfo=:none f(-1.) .text .file "f" .globl julia_f_317 # -- Begin function julia_f_317 .p2align 4, 0x90 .type julia_f_317,@function julia_f_317: # @julia_f_317 .cfi_startproc # %bb.0: # %top vsqrtsd %xmm0, %xmm0, %xmm0 retq .Lfunc_end0: .size julia_f_317, .Lfunc_end0-julia_f_317 .cfi_endproc # -- End function .section ".note.GNU-stack","",@progbits
What’re you asking?
I was saying it gets rid of the errors for
Float32, but otherwise it’ll hit
Base.sqrt as a fallback.
NaNMath was equally fast, because it got rid of the error, even though LLVM failed to get rid of the check
vxorpd xmm1, xmm1, xmm1 vucomisd xmm1, xmm0 ja .LBB0_1 # %bb.2: # %select.true.sink vsqrtsd xmm0, xmm0, xmm0 ret .LBB0_1: movabs rax, offset .LCPI0_0 vmovsd xmm0, qword ptr [rax] # xmm0 = mem,zero ret
Thank you everyone, this are all good suggestions!
Indeed, the issue was that I am calling
Int64s. I was not aware that this is considered a “weird type” - are there any other pathologies I should be aware of? With root-taking, perhaps the correct way to go about this is to manually convert to a floating type and then call
Just out of curiosity, how do the check implementations differ so that one is optimizable and one is not?
Interesting! I expected that since there’s instruction level root taking, it wouldn’t matter if it is operating on 32 or 64 bits, but I’m seeing a significant speed boost in with 32. I am working on relatively small integers which should comfortably fit in
Float32’s range, so that might be the best way to go!
Also interesting! Is it indeed a 128 bits limitation, or a 2 operands limitation? i.e. would it be able to operate on four
You can use
Cthulhu.@descend to look at the typed Julia code, LLVM IR, and assembly.
Ints are for things like indexing arrays, not numerical computation IMO.
The integer has to be converted to a floating point number first before getting the root, anyway. It must also be converted before operating on it with other floating point numbers.
Other possible issues are things like overflow: The speed of light is an integer. Why should we care?
Float64 gives you vsqrtsd and Float32 and Float16 vsqrtss which is faster, but on some hypothetical CPU (or some historical RISC or current have sqrt approximation only?) you might not have sqrt in hardware (GPUs?), or say only for Float16, and then this might help:
julia> Float64(Base.sqrt_llvm(Float16(10000.0))) # 1000000.0 will give you an Inf. 100.0
There might be some trick is software (but unlikely to beat hardware) like in this case if you can go away with low precision:
If you work with logarithmic numbers then * simplifies to +, and power or square root should be * (and /, which is really * for constant root)?
The problem is + and - get slower…
It’s on you to make sure you don’t put in negative number, I’m not sure you can rely on NaN given, it might be CPU specific (“implementation defined”? At least when no CPU instruction, I’m not sure what most likely algorithm would do)?
julia> Base.sqrt_llvm(-10000.0) NaN
There are hardware fast inverse square root instructions that are also inaccurate, but faster than this. With
@fastmath, the compiler will generate them for you + add a Newton Raphson iteration to improve accuracy.
julia> function fastinvsqrt!(y,x) @inbounds @fastmath for i = eachindex(y,x) y[i] = 1/sqrt(x[i]) end end fastinvsqrt! (generic function with 1 method) julia> @code_native debuginfo = :none syntax = :intel fastinvsqrt!(Float32, Float32)
So to improve accuracy (only one iteration, which doubles number of correct bits?), but not to full? How would you just get this instruction which is vrsqrtps (and vrsqrtss), i.e. lowest accuracy (if ok for you)?
@code_native give me is a lot longer than just that one or two instructions… but the core part isn’t that long basically 7, while it seemingly could be only the first one:
rsqrt(x) = @fastmath 1/sqrt(x) @code_native rsqrt(10f0) # it's a lot longer if not called with Float32 vrsqrtss %xmm0, %xmm0, %xmm1 vmulss %xmm1, %xmm0, %xmm0 movabsq $.LCPI0_0, %rax vfmadd213ss (%rax), %xmm1, %xmm0 # xmm0 = (xmm1 * xmm0) + mem movabsq $.LCPI0_1, %rax vmulss (%rax), %xmm1, %xmm1 vmulss %xmm0, %xmm1, %xmm0 retq
The relative error for this approximation is:
|Relative Error| ≤ 1.5 ∗ 2^−12
unclear this applies, googling for those instructions I found RSQRTSS and I n’t sure about the -ps ending.
vrsqrtps executes once-per-cycle throughput on Skylake / Icelake.
You can always compute newton’s method afterwards to improve the accuracy. But getting the maximum accuracy from one cycle is probably best.
See page 625, i.e. section 18.25.1 for a table.
RSQRT14PS + 1 Newton-Raphson iteration is actually more accurate (23 bits) than sqrt + div, making it another case where
@fastmath increases accuracy; the transform speeds up code and makes it more accurate, but changing results isn’t allowed, which is why it is forbidden by default.
Although this is for the AVX512 version, which is a little bit more accurate than
vrsqrtss, and also exists in double precision flavors (which have the same precision as the single precision versions, only they operate on
The “ps” ending meeans “packed single”, i.e. packed (vector) of single-precision.
“ss” means “single single” (operating on just 1 Float32)
“sd” means “single double” (operating on just 1 Float64)
“ps” means “packed single” (operating on a vector of Float32)
“pd” means “packed double” (operating on a vector of Float64)
For me, it is also heavily unrolled.
But yes, it could just be the reciprical square root. The rest is to fix the accuracy.
llvmcall it. You’re welcome to make a PR to VectorizationBase.
You could basically just copy
inv_fast code there and make a
approx_inv_sqrt, and an
inv_sqrt that adds the NR iterations.
That’s very intriguing (what’s the other case?) likely why the (2022) Posit standard requires also e.g. rSqrt(posit). I’m not sure if IEEE does too by now (I doubt, maybe only recommends having it?). At least either way it seems like Julia should have such a function, since not a lot of people would know this trivia, and it could use that exact same
@fastmath code and be documented from e.g. sqrt (and cbrt too? for similar, though I doubt CPUs have hardware support). Then I guess for that function
@fastmath could be implemented to do even faster skipping the single Newton-Raphson iteration, and would have 23/2 = 11 correct bits.
We generally target our
@fastmath at much higher accuracy than that.
Ok, since it would be a function, it could be a keyword arg.