Two nearly identical loops — why is one faster?

The following two functions both evaluate Chebyshev polynomials, using slightly different recurrences that nevertheless require nearly the same number of operations, both are type-stable with 0 allocations, and yet one of the two is 20–30% faster than the other. (In fact, the loop that requires more floating-point operations—one more multiply per iteration—is faster!) Why?

function cheb2(a, ξ)
    N = length(a)
    Tₖ₋₂ = one(ξ)
    Tₖ₋₁ = ξ
    @inbounds y = Tₖ₋₂*a[1] + Tₖ₋₁*a[2]
    for k = 3:N
        Tₖ = 2*ξ*Tₖ₋₁ - Tₖ₋₂
        @inbounds y += Tₖ*a[k]
        Tₖ₋₂ = Tₖ₋₁
        Tₖ₋₁ = Tₖ
    end
    return y
end

function cheb1r(c, xd)
    n = length(c)
    @inbounds bₖ = c[2] + 2xd*c[1]
    @inbounds bₖ₊₁ = oftype(bₖ, c[1])
    for j = 3:n-1
        @inbounds bⱼ = c[j] + 2xd*bₖ - bₖ₊₁
        bₖ₊₁ = bₖ
        bₖ = bⱼ
    end
    @inbounds return c[n] + xd*bₖ - bₖ₊₁
end

For n=200 (so that only the inner loops should matter?), we get equivalent results from both functions but very different timings:

julia> c = rand(200); rc = reverse(c);

julia> using BenchmarkTools

julia> @btime cheb2($c, 0.3)
  490.411 ns (0 allocations: 0 bytes)
-1.674025847985696

julia> @btime cheb1r($rc, 0.3)
  639.731 ns (0 allocations: 0 bytes)
-1.6740258479856949

Any ideas? (This is from FastChebInterp#5, cc @markmbaum.)

4 Likes

Aha, I found the issue. In cheb1r, the compiler apparently wasn’t finding a fused multiply-add optimization because of the order in which the operations had been written. Rewriting it as:

function cheb1r_muladd(c, xd)
    n = length(c)
    @inbounds bₖ = c[2] + 2xd*c[1]
    @inbounds bₖ₊₁ = oftype(bₖ, c[1])
    for j = 3:n-1
        @inbounds bⱼ = muladd(2xd, bₖ, c[j]) - bₖ₊₁
        bₖ₊₁ = bₖ
        bₖ = bⱼ
    end
    @inbounds return c[n] + xd*bₖ - bₖ₊₁
end

to explicitly identify the multiply-add operation fixes the performance, and makes cheb1r_muladd the fastest of the 3:

julia> @btime cheb1r_muladd($rc, 0.3)
  392.990 ns (0 allocations: 0 bytes)
-1.6740258479856975
9 Likes

In fact, just using @fastmath is sufficient.

4 Likes