@btime of cis Function slower than writing it again

I found a quite strange effect when evaluating the performance of cis via @btime:

sz = (2000,2000)
a = Float32.(2pi .* rand(sz...));
function cis_fast(phi::T) where {T<:Real}
    complex(sincos(T(pi/2) .- phi)...)
r = zeros(ComplexF32, sz);
@btime $r .= cis_fast.($a); # 42 ms 
@btime $r .= cis.($a); # 46 ms

For the function as written, I consistently get faster results as the inbuilt version. I find this odd, since the inbuilt version looks like it requires fewer calculations. Tested in Julia 1.7.1 and 1.8.0.

It looks like it has to do with argument reduction.

The sincos function, along with other trig functions, is fastest for arguments in [-\pi/4, \pi/4], and otherwise has to reduce the argument modulo π/2 to that range. For your arguments distributed uniformly in [0,2\pi), computing \pi/2 - \phi increases the probability of the argument being in [-\pi/4, \pi/4], and hence speeds it up on average.

If you do a = Float32.((pi/2) .* rand(sz...)), then \pi/2 - \phi does not change the distribution of magnitudes and hence the two functions become about equally fast on my machine.


Fantastic. I did not think of this, but it explains it. Good to know that that range is optimal.