NaNMath.sqrt, despite seemingly doing more work, is likely to help with performance optimization as well, as its check can be optimized, while `Base.sqrt`

, throwing an error, cannot.

```
julia> x = rand(256); y = similar(x);
julia> @benchmark @. $y = sqrt($x)
BenchmarkTools.Trial: 10000 samples with 227 evaluations.
Range (min … max): 334.670 ns … 1.052 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 334.780 ns ┊ GC (median): 0.00%
Time (mean ± σ): 336.468 ns ± 10.455 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█ ▃▁ ▁ ▁ ▁
█▄▄▄▃██▇███▇▆▅▄▄▄▅▅▄▅▄▇▅▃▄▅▅▄▄▄▃▃▃▇▁▄▃▄▆▄▅▃▃▅▃▄▄▅▄▅▅▄▄▄▄▅▄▅█ █
335 ns Histogram: log(frequency) by time 376 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark @. $y = Base.FastMath.sqrt_fast($x)
BenchmarkTools.Trial: 10000 samples with 755 evaluations.
Range (min … max): 167.358 ns … 432.384 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 167.364 ns ┊ GC (median): 0.00%
Time (mean ± σ): 168.638 ns ± 5.218 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█ ▄▂ ▂ ▁
█▆▁███▇▇▆▅█▆▆▆▆▆▅▆▆▄▅▅█▇▆▅▅▆▅▅▅▅▅▅▅▆▆▆▅▅▆▅▆▄▅▅▅▄▆▅▅▅▅▅▅▅▅▅▅█▆ █
167 ns Histogram: log(frequency) by time 188 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark @. $y = NaNMath.sqrt($x)
BenchmarkTools.Trial: 10000 samples with 755 evaluations.
Range (min … max): 167.358 ns … 526.793 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 167.366 ns ┊ GC (median): 0.00%
Time (mean ± σ): 168.457 ns ± 5.773 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█ ▄▂ ▁ ▁
█▆▁████▇▆▅█▆▆▅▆▆▅▆▅▅▆▅█▇▅▅▅▅▅▅▅▅▆▄▆▄▅▅▅▅▅▅▅▅▄▄▄▄▅▅▄▃▄▅▄▅▆▆▅█▇ █
167 ns Histogram: log(frequency) by time 188 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
```

This was done on an Intel CPUs, and to date, all Intel CPUs can only do 128 bits of `sqrt`

per clock cycle, so SIMD only gives us a 2x speedup.

That is, using YMM registers takes twice as long as using XMM registers, although operating on only a single element of an XMM register is the same speed,

Note that this doesn’t hold for AMD Ryzen. For Zen2 and Zen3 in particular, it takes the same time regardless of 1,2, or 4 elements, meaning the vectorized form should be 4x (rather than 2x) faster. It is much slower than Intel for 1 or 2 elements, however.

So in that case, I’d suggest going with `NaNMath`

; it avoids throwing an error, and it is the maybe throwing an error that is expensive. Not the check, which is essentially free compared to the expensive `sqrt`

, and the check can be vectorized.

Note that because it is implemented as

```
sqrt(x::Real) = x < 0.0 ? NaN : Base.sqrt(x)
```

it’ll return a `Float64`

unconditionally when `< 0.0`

, so `NaNMath.sqrt(4f0)`

will be type unstable.

ForwardDiff.Dual overloads it though, so it’ll be type stable with `ForwardDiff.Dual`

s, which is the primary special number type I care about.