I recently looked a little at llvm; just fyi (especially @ChrisRackauckas for mixed precision math in other projects), you can get a big speedup by using
rsqrtss(x::Float32) =
Base.llvmcall(("declare <4 x float> @llvm.x86.sse.rsqrt.ss(<4 x float>) nounwind readnone",
"%v = insertelement <4 x float> undef, float %0, i32 0
%resv = call <4 x float> @llvm.x86.sse.rsqrt.ss(<4 x float> %v)
%res = extractelement <4 x float> %resv, i32 0
ret float %res"), Float32, Tuple{Float32}, x)
This computes a very fast 12-15 bit reciprocal squareroot (single instruction on x86, inlined). The llvm declaration for this intrinsic is kinda weird, but llvm is smart enough to only use one xmm register and not invalidate the others. Still, the resulting native code makes somewhat strange choices in register allocation (unnecessary movps); apparently there are not a lot of people using this intrinsic.
Using this instead of the division and square-root gives a speedup of ca. 2x vs double precision and 1.5x vs single precision on my machine (at the cost of abysmal precision).