The following two functions both evaluate Chebyshev polynomials, using slightly different recurrences that nevertheless require nearly the same number of operations, both are type-stable with 0 allocations, and yet one of the two is 20–30% faster than the other. (In fact, the loop that requires more floating-point operations—one more multiply per iteration—is faster!) Why?
function cheb2(a, ξ) N = length(a) Tₖ₋₂ = one(ξ) Tₖ₋₁ = ξ @inbounds y = Tₖ₋₂*a + Tₖ₋₁*a for k = 3:N Tₖ = 2*ξ*Tₖ₋₁ - Tₖ₋₂ @inbounds y += Tₖ*a[k] Tₖ₋₂ = Tₖ₋₁ Tₖ₋₁ = Tₖ end return y end function cheb1r(c, xd) n = length(c) @inbounds bₖ = c + 2xd*c @inbounds bₖ₊₁ = oftype(bₖ, c) for j = 3:n-1 @inbounds bⱼ = c[j] + 2xd*bₖ - bₖ₊₁ bₖ₊₁ = bₖ bₖ = bⱼ end @inbounds return c[n] + xd*bₖ - bₖ₊₁ end
n=200 (so that only the inner loops should matter?), we get equivalent results from both functions but very different timings:
julia> c = rand(200); rc = reverse(c); julia> using BenchmarkTools julia> @btime cheb2($c, 0.3) 490.411 ns (0 allocations: 0 bytes) -1.674025847985696 julia> @btime cheb1r($rc, 0.3) 639.731 ns (0 allocations: 0 bytes) -1.6740258479856949