 # Two nearly identical loops — why is one faster?

The following two functions both evaluate Chebyshev polynomials, using slightly different recurrences that nevertheless require nearly the same number of operations, both are type-stable with 0 allocations, and yet one of the two is 20–30% faster than the other. (In fact, the loop that requires more floating-point operations—one more multiply per iteration—is faster!) Why?

``````function cheb2(a, ξ)
N = length(a)
Tₖ₋₂ = one(ξ)
Tₖ₋₁ = ξ
@inbounds y = Tₖ₋₂*a + Tₖ₋₁*a
for k = 3:N
Tₖ = 2*ξ*Tₖ₋₁ - Tₖ₋₂
@inbounds y += Tₖ*a[k]
Tₖ₋₂ = Tₖ₋₁
Tₖ₋₁ = Tₖ
end
return y
end

function cheb1r(c, xd)
n = length(c)
@inbounds bₖ = c + 2xd*c
@inbounds bₖ₊₁ = oftype(bₖ, c)
for j = 3:n-1
@inbounds bⱼ = c[j] + 2xd*bₖ - bₖ₊₁
bₖ₊₁ = bₖ
bₖ = bⱼ
end
@inbounds return c[n] + xd*bₖ - bₖ₊₁
end
``````

For `n=200` (so that only the inner loops should matter?), we get equivalent results from both functions but very different timings:

``````julia> c = rand(200); rc = reverse(c);

julia> using BenchmarkTools

julia> @btime cheb2(\$c, 0.3)
490.411 ns (0 allocations: 0 bytes)
-1.674025847985696

julia> @btime cheb1r(\$rc, 0.3)
639.731 ns (0 allocations: 0 bytes)
-1.6740258479856949
``````

Any ideas? (This is from FastChebInterp#5, cc @markmbaum.)

4 Likes

Aha, I found the issue. In `cheb1r`, the compiler apparently wasn’t finding a fused multiply-add optimization because of the order in which the operations had been written. Rewriting it as:

``````function cheb1r_muladd(c, xd)
n = length(c)
@inbounds bₖ = c + 2xd*c
@inbounds bₖ₊₁ = oftype(bₖ, c)
for j = 3:n-1
@inbounds bⱼ = muladd(2xd, bₖ, c[j]) - bₖ₊₁
bₖ₊₁ = bₖ
bₖ = bⱼ
end
@inbounds return c[n] + xd*bₖ - bₖ₊₁
end
``````

to explicitly identify the multiply-add operation fixes the performance, and makes `cheb1r_muladd` the fastest of the 3:

``````julia> @btime cheb1r_muladd(\$rc, 0.3)
392.990 ns (0 allocations: 0 bytes)
-1.6740258479856975
``````
9 Likes

In fact, just using `@fastmath` is sufficient.

4 Likes