Avoid negative Values

Are there any optimizations to be had for sqrt (or other functions) if you make a function, just for the range [0, 1]? I Googled a bit to see, in general and for Julia.

At least you can get 33% speedup with:

julia> @btime @fastmath Float64(sqrt(Float32($x)))
  1.885 ns (0 allocations: 0 bytes)
0.7071067690849304

vs.

julia> @btime sqrt($x);
  2.805 ns (0 allocations: 0 bytes)

julia> @btime $x^0.5;
  3.374 ns (0 allocations: 0 bytes)

By now, fastmath scares me a bit (and for sure in the global setting), but it seemed ok for this one value x = 0.5 I tried.

I found a 2021 paper on square roots but it may apply less to x86:
https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&ved=2ahUKEwi2yon2q-PyAhUQrqQKHTvaCNAQFnoECCQQAQ&url=https%3A%2F%2Fwww.mdpi.com%2F2079-3197%2F9%2F2%2F21%2Fpdf&usg=AOvVaw04yvJ68GPLEaH-ZbZtjM-4

Our experimental results show that the proposed algorithms provide a fairly good trade-off between accuracy and latency after two iterations for numbers of type float, and after three iterations for numbers of type double when using fused multiply–add instructions—giving almost complete accuracy. […]
Polynomial methods of high order rely heavily on multiplications and need to store
polynomial coefficients in memory; they also require a range reduction step

range reduction step could be skipped? And polynomial simplified?

Seem relevant on page 13 (also alg. 6 and 11):

Algorithm 5. Proposed Sqrt31f algorithm (DC initial approximation)

As shown in Table 1, the proposed algorithms give significantly better performance
than the library functions on the Raspberry Pi, from 3.17 to 3.62 times faster, and for SP
numbers on ESP-32, 2.34 times faster for the reciprocal square root and approximately
1.78 times faster than the sqrtf (x) function.