Unboxing in iteration

Tamas_Papp · March 18, 2017, 8:58am

I am optimizing the following simple function:

"""
Calculate ``∑ₖ cₖ fₖ, k=1,...,n`` where ``fₖ₊₁ = α fₖ + β fₖ₋₁`` and
the first two elements are given. Uses the reverse Clenshaw algorithm,
stability is assumed.
"""
function clenshaw{T}(c::AbstractVector{T}, α::T, β::T, f1::T, f2::T)
    bnn = bn = zero(T)
    for k in length(c):-1:2
        bn, bnn = c[k] + α*bn + β*bnn, bn
    end
    f1*(c[1]+β*bnn) + f2*bn
end

and I think my problem is what happens to SSAValue(6) below, but don’t know what I can do about it:

julia> @code_warntype clenshaw(rand(10), 1.0, -1.0, 1.0, 0.5)
Variables:
  #self#::#clenshaw
  c::Array{Float64,1}
  α::Float64
  β::Float64
  f1::Float64
  f2::Float64
  bn::Float64
  bnn::Float64
  #temp#::Int64
  k::Int64

Body:
  begin 
      SSAValue(0) = (Base.box)(Float64,(Base.sitofp)(Float64,0))
      bn::Float64 = SSAValue(0)
      bnn::Float64 = SSAValue(0) # line 8:
      SSAValue(5) = (Base.arraylen)(c::Array{Float64,1})::Int64
      # meta: location range.jl colon 115
      # meta: location range.jl Type 70
      # meta: location range.jl Type 18
      SSAValue(6) = $(Expr(:invoke, LambdaInfo for steprange_last(::Int64, ::Int64, ::Int64), :(Base.steprange_last), SSAValue(5), -1, 2))
      # meta: pop location
      # meta: pop location
      # meta: pop location
      #temp#::Int64 = SSAValue(5)
      14: 
      unless (Base.box)(Base.Bool,(Base.not_int)((Base.box)(Base.Bool,(Base.or_int)((Base.box)(Base.Bool,(Base.and_int)((Base.box)(Base.Bool,(Base.not_int)((SSAValue(5) === SSAValue(6))::Bool)),(Base.box)(Base.Bool,(Base.not_int)(((Base.slt_int)(0,-1)::Bool === (Base.slt_int)(SSAValue(5),SSAValue(6))::Bool)::Bool)))),(#temp#::Int64 === (Base.box)(Int64,(Base.add_int)(SSAValue(6),-1)))::Bool)))) goto 27
      SSAValue(7) = #temp#::Int64
      SSAValue(8) = (Base.box)(Int64,(Base.add_int)(#temp#::Int64,-1))
      k::Int64 = SSAValue(7)
      #temp#::Int64 = SSAValue(8) # line 9:
      SSAValue(3) = (Base.box)(Base.Float64,(Base.add_float)((Base.box)(Base.Float64,(Base.add_float)((Base.arrayref)(c::Array{Float64,1},k::Int64)::Float64,(Base.box)(Base.Float64,(Base.mul_float)(α::Float64,bn::Float64)))),(Base.box)(Base.Float64,(Base.mul_float)(β::Float64,bnn::Float64))))
      SSAValue(4) = bn::Float64
      bn::Float64 = SSAValue(3)
      bnn::Float64 = SSAValue(4)
      25: 
      goto 14
      27:  # line 11:
      return (Base.box)(Base.Float64,(Base.add_float)((Base.box)(Base.Float64,(Base.mul_float)(f1::Float64,(Base.box)(Base.Float64,(Base.add_float)((Base.arrayref)(c::Array{Float64,1},1)::Float64,(Base.box)(Base.Float64,(Base.mul_float)(β::Float64,bnn::Float64)))))),(Base.box)(Base.Float64,(Base.mul_float)(f2::Float64,bn::Float64))))
  end::Float64

I would like to speed this up (it is used in an inner loop a lot in my code), but at the same time I would also like to understand what is going on with SSAValue(6).

dlfivefifty · March 18, 2017, 9:49am

FYI optimised clenshaw code:

https://github.com/JuliaApproximation/ApproxFun.jl/blob/development/src/LinearAlgebra/clenshaw.jl

Tamas_Papp · March 18, 2017, 10:20am

Thanks, and actually I am aware of that (I have been reading the ApproxFun codebase), but in this case my problem is more about learning how to deal with this than the actual function. clenshaw was just an MWE.

kristoffer.carlsson · March 18, 2017, 11:22am

What makes you think there is a problem? It runs in 30 ns with 0 allocations. The steprange_last is just a function call that is not inlined.

Tamas_Papp · March 18, 2017, 11:41am

I am not an expert, but I read that Clenshaw is better for Chebyshev, however, the naive Chebyshev evaluator below is about 2x as fast as clenshaw, so I thought there was a problem. MWE:

"""
Calculate ``∑ₖ cₖ fₖ, k=1,...,n`` where ``fₖ₊₁ = α fₖ + β fₖ₋₁`` and
the first two elements are given. Uses the reverse Clenshaw algorithm,
stability is assumed.
"""
function clenshaw{T}(c::AbstractVector{T}, α::T, β::T, f1::T, f2::T)
    bnn = bn = zero(T)
    for k in length(c):-1:2
        bn, bnn = c[k] + α*bn + β*bnn, bn
    end
    f1*(c[1]+β*bnn) + f2*bn
end

"""
Evaluate the `θ`-weighted sum of the first `n` Chebyshev polynomials
at `x`.
"""
function evaluate_chebyshev{T}(n, θ::AbstractVector{T}, x::T)
    xp = x
    xpp = one(T)
    s = zero(x)
    for i in 1:n
        if i == 1
            s += xpp*θ[i]
        elseif i == 2
            s += xp*θ[i]
        else
            xp, xpp = 2*x*xp - xpp, xp
            s += xp*θ[i]
        end
    end
    s
end

using BenchmarkTools
θ = rand(10)
clenshaw(θ, 1.0, -1.0, 1.0, 0.5) ≈ evaluate_chebyshev(10, θ, 0.5)

@benchmark clenshaw($θ, 1.0, -1.0, 1.0, 0.5) # ≈37ns
@benchmark evaluate_chebyshev(10, $θ, 0.5)   # ≈18ns

Am I benchmarking incorrectly?

m-j-w · March 18, 2017, 11:24pm

Don’t know about you benchmarking wrong, but the following gives you another 10ns improvement:

julia> function clenshaw2{T}(c::AbstractVector{T}, α::T, β::T, f1::T, f2::T)
           bnn = bn = zero(T)
           l = length(c)
           for f in Base.OneTo(l-1)
               k = l+1-f
               bn, bnn = c[k] + α*bn + β*bnn, bn
           end
           f1*(c[1]+β*bnn) + f2*bn
       end

julia> clenshaw(θ, 1.0, -1.0, 1.0, 0.5)
0.3977626513630642
julia> clenshaw2(θ, 1.0, -1.0, 1.0, 0.5)
0.3977626513630642

julia> @benchmark clenshaw($θ, 1.0, -1.0, 1.0, 0.5) # ≈29.4ns
julia> @benchmark clenshaw2($θ, 1.0, -1.0, 1.0, 0.5) # ≈19.8ns

When you use @code_native on both versions, then you see now nicely vectorized code.
Non-standard looping seems to irritate the assembly optimizer from time to time and prevent it from recognizing optimisation/vectorisation opportunities.
(Hopefully I haven’t accidentally changed the number of iterations…)

And if you can make your function parameters compile-time constant, add a few @inbounds, you gain another 6.5ns:

julia> function clenshaw3{T}(c::AbstractVector{T})
           α, β, f1, f2 = 1.0, -1.0, 1.0, 0.5
           bnn = bn = zero(T)
           l = length(c)
           for f in Base.OneTo(l-1)
               k = l+1-f
               @inbounds bn, bnn = c[k] + α*bn + β*bnn, bn
           end
           @inbounds r = f1*(c[1]+β*bnn) + f2*bn
           return r
       end

julia> @benchmark clenshaw3($θ) # ≈13.3ns

Note that the first optimisation above (changing the loop) only has relevant effect on very small arrays, as in your case with length 10. The second (constant factors) has significant effect for large arrays since the repeated multiplications are optimised away, in particular those with 1.0.

julia> const θ = rand(32)
julia> @benchmark clenshaw($θ, 1.0, -1.0, 1.0, 0.5)  # ~119ns
julia> @benchmark clenshaw2($θ, 1.0, -1.0, 1.0, 0.5) # ≈110ns
julia> @benchmark clenshaw3($θ)                      # ~ 76ns

Topic		Replies	Views
Fancy LLVM loop transformation pass that deoptimize the code Performance llvm	2	203	April 16, 2025
Compiler optimization for variables and functions Performance	16	1416	September 5, 2018
Optimising function for broadcast Performance broadcast	9	477	October 4, 2022
Efficient iteration with fixed size array types without `@generated` functions Performance	9	201	September 26, 2024
Help understanding vectorization (or lack thereof) Performance	15	1214	June 8, 2018

Unboxing in iteration

Related topics