# Unboxing in iteration

#1

I am optimizing the following simple function:

``````"""
Calculate ``∑ₖ cₖ fₖ, k=1,...,n`` where ``fₖ₊₁ = α fₖ + β fₖ₋₁`` and
the first two elements are given. Uses the reverse Clenshaw algorithm,
stability is assumed.
"""
function clenshaw{T}(c::AbstractVector{T}, α::T, β::T, f1::T, f2::T)
bnn = bn = zero(T)
for k in length(c):-1:2
bn, bnn = c[k] + α*bn + β*bnn, bn
end
f1*(c[1]+β*bnn) + f2*bn
end
``````

and I think my problem is what happens to `SSAValue(6)` below, but don’t know what I can do about it:

``````julia> @code_warntype clenshaw(rand(10), 1.0, -1.0, 1.0, 0.5)
Variables:
#self#::#clenshaw
c::Array{Float64,1}
α::Float64
β::Float64
f1::Float64
f2::Float64
bn::Float64
bnn::Float64
#temp#::Int64
k::Int64

Body:
begin
SSAValue(0) = (Base.box)(Float64,(Base.sitofp)(Float64,0))
bn::Float64 = SSAValue(0)
bnn::Float64 = SSAValue(0) # line 8:
SSAValue(5) = (Base.arraylen)(c::Array{Float64,1})::Int64
# meta: location range.jl colon 115
# meta: location range.jl Type 70
# meta: location range.jl Type 18
SSAValue(6) = \$(Expr(:invoke, LambdaInfo for steprange_last(::Int64, ::Int64, ::Int64), :(Base.steprange_last), SSAValue(5), -1, 2))
# meta: pop location
# meta: pop location
# meta: pop location
#temp#::Int64 = SSAValue(5)
14:
unless (Base.box)(Base.Bool,(Base.not_int)((Base.box)(Base.Bool,(Base.or_int)((Base.box)(Base.Bool,(Base.and_int)((Base.box)(Base.Bool,(Base.not_int)((SSAValue(5) === SSAValue(6))::Bool)),(Base.box)(Base.Bool,(Base.not_int)(((Base.slt_int)(0,-1)::Bool === (Base.slt_int)(SSAValue(5),SSAValue(6))::Bool)::Bool)))),(#temp#::Int64 === (Base.box)(Int64,(Base.add_int)(SSAValue(6),-1)))::Bool)))) goto 27
SSAValue(7) = #temp#::Int64
k::Int64 = SSAValue(7)
#temp#::Int64 = SSAValue(8) # line 9:
SSAValue(4) = bn::Float64
bn::Float64 = SSAValue(3)
bnn::Float64 = SSAValue(4)
25:
goto 14
27:  # line 11:
end::Float64
``````

I would like to speed this up (it is used in an inner loop a lot in my code), but at the same time I would also like to understand what is going on with `SSAValue(6)`.

#2

FYI optimised clenshaw code:

#3

Thanks, and actually I am aware of that (I have been reading the `ApproxFun` codebase), but in this case my problem is more about learning how to deal with this than the actual function. `clenshaw` was just an MWE.

#4

What makes you think there is a problem? It runs in 30 ns with 0 allocations. The `steprange_last` is just a function call that is not inlined.

#5

I am not an expert, but I read that Clenshaw is better for Chebyshev, however, the naive Chebyshev evaluator below is about 2x as fast as `clenshaw`, so I thought there was a problem. MWE:

``````"""
Calculate ``∑ₖ cₖ fₖ, k=1,...,n`` where ``fₖ₊₁ = α fₖ + β fₖ₋₁`` and
the first two elements are given. Uses the reverse Clenshaw algorithm,
stability is assumed.
"""
function clenshaw{T}(c::AbstractVector{T}, α::T, β::T, f1::T, f2::T)
bnn = bn = zero(T)
for k in length(c):-1:2
bn, bnn = c[k] + α*bn + β*bnn, bn
end
f1*(c[1]+β*bnn) + f2*bn
end

"""
Evaluate the `θ`-weighted sum of the first `n` Chebyshev polynomials
at `x`.
"""
function evaluate_chebyshev{T}(n, θ::AbstractVector{T}, x::T)
xp = x
xpp = one(T)
s = zero(x)
for i in 1:n
if i == 1
s += xpp*θ[i]
elseif i == 2
s += xp*θ[i]
else
xp, xpp = 2*x*xp - xpp, xp
s += xp*θ[i]
end
end
s
end

using BenchmarkTools
θ = rand(10)
clenshaw(θ, 1.0, -1.0, 1.0, 0.5) ≈ evaluate_chebyshev(10, θ, 0.5)

@benchmark clenshaw(\$θ, 1.0, -1.0, 1.0, 0.5) # ≈37ns
@benchmark evaluate_chebyshev(10, \$θ, 0.5)   # ≈18ns
``````

Am I benchmarking incorrectly?

#6

Don’t know about you benchmarking wrong, but the following gives you another 10ns improvement:

``````julia> function clenshaw2{T}(c::AbstractVector{T}, α::T, β::T, f1::T, f2::T)
bnn = bn = zero(T)
l = length(c)
for f in Base.OneTo(l-1)
k = l+1-f
bn, bnn = c[k] + α*bn + β*bnn, bn
end
f1*(c[1]+β*bnn) + f2*bn
end

julia> clenshaw(θ, 1.0, -1.0, 1.0, 0.5)
0.3977626513630642
julia> clenshaw2(θ, 1.0, -1.0, 1.0, 0.5)
0.3977626513630642

julia> @benchmark clenshaw(\$θ, 1.0, -1.0, 1.0, 0.5) # ≈29.4ns
julia> @benchmark clenshaw2(\$θ, 1.0, -1.0, 1.0, 0.5) # ≈19.8ns
``````

When you use `@code_native` on both versions, then you see now nicely vectorized code.
Non-standard looping seems to irritate the assembly optimizer from time to time and prevent it from recognizing optimisation/vectorisation opportunities.
(Hopefully I haven’t accidentally changed the number of iterations…)

And if you can make your function parameters compile-time constant, add a few `@inbounds`, you gain another 6.5ns:

``````julia> function clenshaw3{T}(c::AbstractVector{T})
α, β, f1, f2 = 1.0, -1.0, 1.0, 0.5
bnn = bn = zero(T)
l = length(c)
for f in Base.OneTo(l-1)
k = l+1-f
@inbounds bn, bnn = c[k] + α*bn + β*bnn, bn
end
@inbounds r = f1*(c[1]+β*bnn) + f2*bn
return r
end

julia> @benchmark clenshaw3(\$θ) # ≈13.3ns
``````

Note that the first optimisation above (changing the loop) only has relevant effect on very small arrays, as in your case with length 10. The second (constant factors) has significant effect for large arrays since the repeated multiplications are optimised away, in particular those with 1.0.

``````julia> const θ = rand(32)
julia> @benchmark clenshaw(\$θ, 1.0, -1.0, 1.0, 0.5)  # ~119ns
julia> @benchmark clenshaw2(\$θ, 1.0, -1.0, 1.0, 0.5) # ≈110ns
julia> @benchmark clenshaw3(\$θ)                      # ~ 76ns
``````