Unboxing in iteration

I am optimizing the following simple function:

"""
Calculate ``∑ₖ cₖ fₖ, k=1,...,n`` where ``fₖ₊₁ = α fₖ + β fₖ₋₁`` and
the first two elements are given. Uses the reverse Clenshaw algorithm,
stability is assumed.
"""
function clenshaw{T}(c::AbstractVector{T}, α::T, β::T, f1::T, f2::T)
    bnn = bn = zero(T)
    for k in length(c):-1:2
        bn, bnn = c[k] + α*bn + β*bnn, bn
    end
    f1*(c[1]+β*bnn) + f2*bn
end

and I think my problem is what happens to SSAValue(6) below, but don’t know what I can do about it:

julia> @code_warntype clenshaw(rand(10), 1.0, -1.0, 1.0, 0.5)
Variables:
  #self#::#clenshaw
  c::Array{Float64,1}
  α::Float64
  β::Float64
  f1::Float64
  f2::Float64
  bn::Float64
  bnn::Float64
  #temp#::Int64
  k::Int64

Body:
  begin 
      SSAValue(0) = (Base.box)(Float64,(Base.sitofp)(Float64,0))
      bn::Float64 = SSAValue(0)
      bnn::Float64 = SSAValue(0) # line 8:
      SSAValue(5) = (Base.arraylen)(c::Array{Float64,1})::Int64
      # meta: location range.jl colon 115
      # meta: location range.jl Type 70
      # meta: location range.jl Type 18
      SSAValue(6) = $(Expr(:invoke, LambdaInfo for steprange_last(::Int64, ::Int64, ::Int64), :(Base.steprange_last), SSAValue(5), -1, 2))
      # meta: pop location
      # meta: pop location
      # meta: pop location
      #temp#::Int64 = SSAValue(5)
      14: 
      unless (Base.box)(Base.Bool,(Base.not_int)((Base.box)(Base.Bool,(Base.or_int)((Base.box)(Base.Bool,(Base.and_int)((Base.box)(Base.Bool,(Base.not_int)((SSAValue(5) === SSAValue(6))::Bool)),(Base.box)(Base.Bool,(Base.not_int)(((Base.slt_int)(0,-1)::Bool === (Base.slt_int)(SSAValue(5),SSAValue(6))::Bool)::Bool)))),(#temp#::Int64 === (Base.box)(Int64,(Base.add_int)(SSAValue(6),-1)))::Bool)))) goto 27
      SSAValue(7) = #temp#::Int64
      SSAValue(8) = (Base.box)(Int64,(Base.add_int)(#temp#::Int64,-1))
      k::Int64 = SSAValue(7)
      #temp#::Int64 = SSAValue(8) # line 9:
      SSAValue(3) = (Base.box)(Base.Float64,(Base.add_float)((Base.box)(Base.Float64,(Base.add_float)((Base.arrayref)(c::Array{Float64,1},k::Int64)::Float64,(Base.box)(Base.Float64,(Base.mul_float)(α::Float64,bn::Float64)))),(Base.box)(Base.Float64,(Base.mul_float)(β::Float64,bnn::Float64))))
      SSAValue(4) = bn::Float64
      bn::Float64 = SSAValue(3)
      bnn::Float64 = SSAValue(4)
      25: 
      goto 14
      27:  # line 11:
      return (Base.box)(Base.Float64,(Base.add_float)((Base.box)(Base.Float64,(Base.mul_float)(f1::Float64,(Base.box)(Base.Float64,(Base.add_float)((Base.arrayref)(c::Array{Float64,1},1)::Float64,(Base.box)(Base.Float64,(Base.mul_float)(β::Float64,bnn::Float64)))))),(Base.box)(Base.Float64,(Base.mul_float)(f2::Float64,bn::Float64))))
  end::Float64

I would like to speed this up (it is used in an inner loop a lot in my code), but at the same time I would also like to understand what is going on with SSAValue(6).

FYI optimised clenshaw code:

https://github.com/JuliaApproximation/ApproxFun.jl/blob/development/src/LinearAlgebra/clenshaw.jl

Thanks, and actually I am aware of that (I have been reading the ApproxFun codebase), but in this case my problem is more about learning how to deal with this than the actual function. clenshaw was just an MWE.

1 Like

What makes you think there is a problem? It runs in 30 ns with 0 allocations. The steprange_last is just a function call that is not inlined.

I am not an expert, but I read that Clenshaw is better for Chebyshev, however, the naive Chebyshev evaluator below is about 2x as fast as clenshaw, so I thought there was a problem. MWE:

"""
Calculate ``∑ₖ cₖ fₖ, k=1,...,n`` where ``fₖ₊₁ = α fₖ + β fₖ₋₁`` and
the first two elements are given. Uses the reverse Clenshaw algorithm,
stability is assumed.
"""
function clenshaw{T}(c::AbstractVector{T}, α::T, β::T, f1::T, f2::T)
    bnn = bn = zero(T)
    for k in length(c):-1:2
        bn, bnn = c[k] + α*bn + β*bnn, bn
    end
    f1*(c[1]+β*bnn) + f2*bn
end

"""
Evaluate the `θ`-weighted sum of the first `n` Chebyshev polynomials
at `x`.
"""
function evaluate_chebyshev{T}(n, θ::AbstractVector{T}, x::T)
    xp = x
    xpp = one(T)
    s = zero(x)
    for i in 1:n
        if i == 1
            s += xpp*θ[i]
        elseif i == 2
            s += xp*θ[i]
        else
            xp, xpp = 2*x*xp - xpp, xp
            s += xp*θ[i]
        end
    end
    s
end

using BenchmarkTools
θ = rand(10)
clenshaw(θ, 1.0, -1.0, 1.0, 0.5) ≈ evaluate_chebyshev(10, θ, 0.5)

@benchmark clenshaw($θ, 1.0, -1.0, 1.0, 0.5) # ≈37ns
@benchmark evaluate_chebyshev(10, $θ, 0.5)   # ≈18ns

Am I benchmarking incorrectly?

Don’t know about you benchmarking wrong, but the following gives you another 10ns improvement:

julia> function clenshaw2{T}(c::AbstractVector{T}, α::T, β::T, f1::T, f2::T)
           bnn = bn = zero(T)
           l = length(c)
           for f in Base.OneTo(l-1)
               k = l+1-f
               bn, bnn = c[k] + α*bn + β*bnn, bn
           end
           f1*(c[1]+β*bnn) + f2*bn
       end

julia> clenshaw(θ, 1.0, -1.0, 1.0, 0.5)
0.3977626513630642
julia> clenshaw2(θ, 1.0, -1.0, 1.0, 0.5)
0.3977626513630642

julia> @benchmark clenshaw($θ, 1.0, -1.0, 1.0, 0.5) # ≈29.4ns
julia> @benchmark clenshaw2($θ, 1.0, -1.0, 1.0, 0.5) # ≈19.8ns

When you use @code_native on both versions, then you see now nicely vectorized code.
Non-standard looping seems to irritate the assembly optimizer from time to time and prevent it from recognizing optimisation/vectorisation opportunities.
(Hopefully I haven’t accidentally changed the number of iterations…)

And if you can make your function parameters compile-time constant, add a few @inbounds, you gain another 6.5ns:

julia> function clenshaw3{T}(c::AbstractVector{T})
           α, β, f1, f2 = 1.0, -1.0, 1.0, 0.5
           bnn = bn = zero(T)
           l = length(c)
           for f in Base.OneTo(l-1)
               k = l+1-f
               @inbounds bn, bnn = c[k] + α*bn + β*bnn, bn
           end
           @inbounds r = f1*(c[1]+β*bnn) + f2*bn
           return r
       end

julia> @benchmark clenshaw3($θ) # ≈13.3ns

Note that the first optimisation above (changing the loop) only has relevant effect on very small arrays, as in your case with length 10. The second (constant factors) has significant effect for large arrays since the repeated multiplications are optimised away, in particular those with 1.0.

julia> const θ = rand(32)
julia> @benchmark clenshaw($θ, 1.0, -1.0, 1.0, 0.5)  # ~119ns
julia> @benchmark clenshaw2($θ, 1.0, -1.0, 1.0, 0.5) # ≈110ns
julia> @benchmark clenshaw3($θ)                      # ~ 76ns
1 Like