Two nearly identical loops — why is one faster?

stevengj · September 10, 2021, 6:31pm

The following two functions both evaluate Chebyshev polynomials, using slightly different recurrences that nevertheless require nearly the same number of operations, both are type-stable with 0 allocations, and yet one of the two is 20–30% faster than the other. (In fact, the loop that requires more floating-point operations—one more multiply per iteration—is faster!) Why?

function cheb2(a, ξ)
    N = length(a)
    Tₖ₋₂ = one(ξ)
    Tₖ₋₁ = ξ
    @inbounds y = Tₖ₋₂*a[1] + Tₖ₋₁*a[2]
    for k = 3:N
        Tₖ = 2*ξ*Tₖ₋₁ - Tₖ₋₂
        @inbounds y += Tₖ*a[k]
        Tₖ₋₂ = Tₖ₋₁
        Tₖ₋₁ = Tₖ
    end
    return y
end

function cheb1r(c, xd)
    n = length(c)
    @inbounds bₖ = c[2] + 2xd*c[1]
    @inbounds bₖ₊₁ = oftype(bₖ, c[1])
    for j = 3:n-1
        @inbounds bⱼ = c[j] + 2xd*bₖ - bₖ₊₁
        bₖ₊₁ = bₖ
        bₖ = bⱼ
    end
    @inbounds return c[n] + xd*bₖ - bₖ₊₁
end

For n=200 (so that only the inner loops should matter?), we get equivalent results from both functions but very different timings:

julia> c = rand(200); rc = reverse(c);

julia> using BenchmarkTools

julia> @btime cheb2($c, 0.3)
  490.411 ns (0 allocations: 0 bytes)
-1.674025847985696

julia> @btime cheb1r($rc, 0.3)
  639.731 ns (0 allocations: 0 bytes)
-1.6740258479856949

Any ideas? (This is from FastChebInterp#5, cc @markmbaum.)

stevengj · September 10, 2021, 6:44pm

Aha, I found the issue. In cheb1r, the compiler apparently wasn’t finding a fused multiply-add optimization because of the order in which the operations had been written. Rewriting it as:

function cheb1r_muladd(c, xd)
    n = length(c)
    @inbounds bₖ = c[2] + 2xd*c[1]
    @inbounds bₖ₊₁ = oftype(bₖ, c[1])
    for j = 3:n-1
        @inbounds bⱼ = muladd(2xd, bₖ, c[j]) - bₖ₊₁
        bₖ₊₁ = bₖ
        bₖ = bⱼ
    end
    @inbounds return c[n] + xd*bₖ - bₖ₊₁
end

to explicitly identify the multiply-add operation fixes the performance, and makes cheb1r_muladd the fastest of the 3:

julia> @btime cheb1r_muladd($rc, 0.3)
  392.990 ns (0 allocations: 0 bytes)
-1.6740258479856975

stevengj · September 10, 2021, 7:03pm

In fact, just using @fastmath is sufficient.

Topic		Replies	Views
Performance issues - Comparison with Matlab Performance question	6	955	June 2, 2020
Improving Computational Performance Performance fast-math	6	942	June 20, 2021
Arithmetic performance of expression Performance	11	371	October 4, 2022
Why are calls to this function slow, compared to Python+Numpy? General Usage fast-math	14	2115	August 13, 2017
@fastmath switched off in Julia 0.7-Dev, deliberately? Performance fast-math	7	1747	June 24, 2018

Two nearly identical loops — why is one faster?

Related topics