# Two nearly identical loops — why is one faster?

The following two functions both evaluate Chebyshev polynomials, using slightly different recurrences that nevertheless require nearly the same number of operations, both are type-stable with 0 allocations, and yet one of the two is 20–30% faster than the other. (In fact, the loop that requires more floating-point operations—one more multiply per iteration—is faster!) Why?

function cheb2(a, ξ)
N = length(a)
Tₖ₋₂ = one(ξ)
Tₖ₋₁ = ξ
@inbounds y = Tₖ₋₂*a[1] + Tₖ₋₁*a[2]
for k = 3:N
Tₖ = 2*ξ*Tₖ₋₁ - Tₖ₋₂
@inbounds y += Tₖ*a[k]
Tₖ₋₂ = Tₖ₋₁
Tₖ₋₁ = Tₖ
end
return y
end

function cheb1r(c, xd)
n = length(c)
@inbounds bₖ = c[2] + 2xd*c[1]
@inbounds bₖ₊₁ = oftype(bₖ, c[1])
for j = 3:n-1
@inbounds bⱼ = c[j] + 2xd*bₖ - bₖ₊₁
bₖ₊₁ = bₖ
bₖ = bⱼ
end
@inbounds return c[n] + xd*bₖ - bₖ₊₁
end

For n=200 (so that only the inner loops should matter?), we get equivalent results from both functions but very different timings:

julia> c = rand(200); rc = reverse(c);

julia> using BenchmarkTools

julia> @btime cheb2(\$c, 0.3)
490.411 ns (0 allocations: 0 bytes)
-1.674025847985696

julia> @btime cheb1r(\$rc, 0.3)
639.731 ns (0 allocations: 0 bytes)
-1.6740258479856949

Any ideas? (This is from FastChebInterp#5, cc @markmbaum.)

4 Likes

Aha, I found the issue. In cheb1r, the compiler apparently wasn’t finding a fused multiply-add optimization because of the order in which the operations had been written. Rewriting it as:

n = length(c)
@inbounds bₖ = c[2] + 2xd*c[1]
@inbounds bₖ₊₁ = oftype(bₖ, c[1])
for j = 3:n-1
@inbounds bⱼ = muladd(2xd, bₖ, c[j]) - bₖ₊₁
bₖ₊₁ = bₖ
bₖ = bⱼ
end
@inbounds return c[n] + xd*bₖ - bₖ₊₁
end

to explicitly identify the multiply-add operation fixes the performance, and makes cheb1r_muladd the fastest of the 3: