Help speeding up a test progam from a book

foobar_lv · November 4, 2017, 8:29pm

I recently looked a little at llvm; just fyi (especially @ChrisRackauckas for mixed precision math in other projects), you can get a big speedup by using

rsqrtss(x::Float32) =
   Base.llvmcall(("declare <4 x float> @llvm.x86.sse.rsqrt.ss(<4 x float>) nounwind readnone", 
 "%v = insertelement <4 x float> undef, float %0, i32 0
  %resv = call <4 x float> @llvm.x86.sse.rsqrt.ss(<4 x float> %v)
  %res = extractelement <4 x float> %resv, i32 0
  ret float %res"), Float32, Tuple{Float32}, x)

This computes a very fast 12-15 bit reciprocal squareroot (single instruction on x86, inlined). The llvm declaration for this intrinsic is kinda weird, but llvm is smart enough to only use one xmm register and not invalidate the others. Still, the resulting native code makes somewhat strange choices in register allocation (unnecessary movps); apparently there are not a lot of people using this intrinsic.

Using this instead of the division and square-root gives a speedup of ca. 2x vs double precision and 1.5x vs single precision on my machine (at the cost of abysmal precision).

Topic		Replies	Views
Trying to understand low performance compared to C++ Performance	13	330	October 2, 2024
Benchmark game challenge and some optimization questions Performance	29	2788	January 13, 2024
Why Julia can be faster than C? New to Julia	3	731	October 11, 2022
Julia speed Performance question	28	2212	March 14, 2024
Help to get my slow Julia code to run as fast as Rust/Java/Lisp Performance	100	4587	August 6, 2021

Help speeding up a test progam from a book

Related topics