The following two functions both evaluate Chebyshev polynomials, using slightly different recurrences that nevertheless require nearly the same number of operations, both are type-stable with 0 allocations, and yet one of the two is 20–30% faster than the other. (In fact, the loop that requires *more* floating-point operations—one more multiply per iteration—is *faster*!) Why?

```
function cheb2(a, ξ)
N = length(a)
Tₖ₋₂ = one(ξ)
Tₖ₋₁ = ξ
@inbounds y = Tₖ₋₂*a[1] + Tₖ₋₁*a[2]
for k = 3:N
Tₖ = 2*ξ*Tₖ₋₁ - Tₖ₋₂
@inbounds y += Tₖ*a[k]
Tₖ₋₂ = Tₖ₋₁
Tₖ₋₁ = Tₖ
end
return y
end
function cheb1r(c, xd)
n = length(c)
@inbounds bₖ = c[2] + 2xd*c[1]
@inbounds bₖ₊₁ = oftype(bₖ, c[1])
for j = 3:n-1
@inbounds bⱼ = c[j] + 2xd*bₖ - bₖ₊₁
bₖ₊₁ = bₖ
bₖ = bⱼ
end
@inbounds return c[n] + xd*bₖ - bₖ₊₁
end
```

For `n=200`

(so that only the inner loops should matter?), we get equivalent results from both functions but very different timings:

```
julia> c = rand(200); rc = reverse(c);
julia> using BenchmarkTools
julia> @btime cheb2($c, 0.3)
490.411 ns (0 allocations: 0 bytes)
-1.674025847985696
julia> @btime cheb1r($rc, 0.3)
639.731 ns (0 allocations: 0 bytes)
-1.6740258479856949
```

Any ideas? (This is from FastChebInterp#5, cc @markmbaum.)