# How to maintain performance despite type instability when you know the type?

#1

I have an example of type instability. My goal is to evaluate a function on a grid of points, but the dimensionality of the grid is determined by determining the rank of the Hessian matrix.

The grid and various StaticArrays are then parameterized in the dimensionality.
A simple example without any parametrization:

``````julia> function construct_random_svector1(p::Int)
x = SVector{p}(Vector{Float64}(p))
end
construct_random_svector1 (generic function with 1 method)

julia> function mult_rand1(p::Int)
x = construct_random_svector(p)
y = construct_random_svector(p)
dot(x,y)
end
mult_rand1 (generic function with 1 method)

julia> function svw1(p::Int)
rand_prod = mult_rand(p)
end
svw1 (generic function with 1 method)

julia> @code_warntype svw1(5)
Variables:
#self#::#svw1
p::Int64
rand_prod::Any
x@_4::Any
y::Any
x@_6::Any
x@_7::Any

Body:
begin
\$(Expr(:inbounds, false))
# meta: location REPL[6] mult_rand 2
# meta: location REPL[5] construct_random_svector 2
SSAValue(1) = ((Core.apply_type)(Main.SVector, p::Int64)::Type{StaticArrays.SArray{Tuple{_},T,1,_} where T} where _)(\$(Expr(:invoke, MethodInstance for randn!(::MersenneTwister, ::Array{Float64,1}), :(Base.Random.randn!), :(Base.Random.GLOBAL_RNG), :(\$(Expr(:foreigncall, :(:jl_alloc_array_1d), Array{Float64,1}, svec(Any, Int64), Array{Float64,1}, 0, :(p), 0))))))::Any
# meta: pop location # line 3:
# meta: location REPL[5] construct_random_svector 2
SSAValue(3) = ((Core.apply_type)(Main.SVector, p::Int64)::Type{StaticArrays.SArray{Tuple{_},T,1,_} where T} where _)(\$(Expr(:invoke, MethodInstance for randn!(::MersenneTwister, ::Array{Float64,1}), :(Base.Random.randn!), :(Base.Random.GLOBAL_RNG), :(\$(Expr(:foreigncall, :(:jl_alloc_array_1d), Array{Float64,1}, svec(Any, Int64), Array{Float64,1}, 0, :(p), 0))))))::Any
# meta: pop location
# meta: pop location
\$(Expr(:inbounds, :pop))
SSAValue(0) = (Main.dot)(SSAValue(1), SSAValue(3))::Any
return SSAValue(0)
end::Any
``````

We can pretty-up the @code_warntype via adding annotations:

``````julia> function mult_rand2(p::Int)
x = construct_random_svector1(p)::SVector{p,Float64}
y = construct_random_svector1(p)::SVector{p,Float64}
dot(x,y)::Float64
end
mult_rand2 (generic function with 1 method)

julia> function svw2(p::Int)
rand_prod = mult_rand2(p)
end
svw2 (generic function with 1 method)

julia> @code_warntype svw2(5)
Variables:
#self#::#svw2
p::Int64
rand_prod::Float64

Body:
begin
SSAValue(0) = \$(Expr(:invoke, MethodInstance for mult_rand2(::Int64), :(Main.mult_rand2), :(p)))
return SSAValue(0)
end::Float64
``````

But this actually worsens performance:

``````julia> using BenchmarkTools

julia> a = 5;

julia> @benchmark svw1(\$a)
BenchmarkTools.Trial:
memory estimate:  1.86 KiB
allocs estimate:  29
--------------
minimum time:     2.624 ÎĽs (0.00% GC)
median time:      2.713 ÎĽs (0.00% GC)
mean time:        3.037 ÎĽs (8.31% GC)
maximum time:     404.043 ÎĽs (97.26% GC)
--------------
samples:          10000
evals/sample:     9

julia> @benchmark svw2(\$a)
BenchmarkTools.Trial:
memory estimate:  3.61 KiB
allocs estimate:  57
--------------
minimum time:     5.842 ÎĽs (0.00% GC)
median time:      5.949 ÎĽs (0.00% GC)
mean time:        6.628 ÎĽs (8.02% GC)
maximum time:     679.427 ÎĽs (97.31% GC)
--------------
samples:          10000
evals/sample:     6
``````

If p were known ahead of time, we could make it known to the compiler:

``````julia> struct num{p}
end

julia> b = num{a}
num{5}

julia> function construct_random_svector3(::Type{num{p}}) where {p}
x = SVector{p}(Vector{Float64}(p))
x
end
construct_random_svector3 (generic function with 1 method)

julia> function mult_rand3(::Type{num{p}}) where {p}
x = construct_random_svector3(num{p})
y = construct_random_svector3(num{p})
dot(x,y)
end
mult_rand3 (generic function with 1 method)

julia> function svw3(::Type{num{p}}) where {p}
rand_prod = mult_rand3(num{p})
end
svw3 (generic function with 1 method)

julia> @code_warntype svw3(b)
Variables:
#self#::#svw3
#unused#::Any
rand_prod::Float64

Body:
begin
SSAValue(0) = \$(Expr(:invoke, MethodInstance for mult_rand3(::Type{num{5}}), :(Main.mult_rand3), num{5}))
return SSAValue(0)
end::Float64

julia> @benchmark svw3(\$b)
BenchmarkTools.Trial:
memory estimate:  272 bytes
allocs estimate:  3
--------------
minimum time:     134.391 ns (0.00% GC)
median time:      137.253 ns (0.00% GC)
mean time:        151.221 ns (5.48% GC)
maximum time:     1.434 ÎĽs (81.61% GC)
--------------
samples:          10000
evals/sample:     888

``````

So far, the best solution I have been able to come up with is

``````julia> function mult_rand4(::Type{num{p}}) where {p}
x = construct_random_svector3(num{p})
y = construct_random_svector3(num{p})
dot(x,y)::Float64
end
mult_rand4 (generic function with 1 method)

julia> function svw4(p::Int)
rand_prod = mult_rand4(num{p})
end
svw4 (generic function with 1 method)

julia> @code_warntype svw4(a)
Variables:
#self#::#svw4
p::Int64
rand_prod::Float64

Body:
begin
SSAValue(0) = (Main.mult_rand4)((Core.apply_type)(Main.num, p::Int64)::Type{num{_}} where _)::Float64
return SSAValue(0)
end::Float64

julia> @benchmark svw4(\$a)
BenchmarkTools.Trial:
memory estimate:  272 bytes
allocs estimate:  3
--------------
minimum time:     424.704 ns (0.00% GC)
median time:      435.721 ns (0.00% GC)
mean time:        466.005 ns (3.25% GC)
maximum time:     7.972 ÎĽs (88.91% GC)
--------------
samples:          10000
evals/sample:     199
``````

Not nearly matching the performance of being parametrized in the first place.

Are there work arounds I am missing?

EDIT:
Also worth pointing out:

``````julia> function construct_random_vector5(p::Int)
x = Vector{Float64}(p)
x
end
construct_random_vector5 (generic function with 1 method)

julia> function mult_rand5(p::Int)
x = construct_random_vector5(p)
y = construct_random_vector5(p)
dot(x,y)
end
mult_rand5 (generic function with 1 method)

julia> function svw5(p::Int)
rand_prod = mult_rand5(p)
end
svw5 (generic function with 1 method)

julia> @code_warntype svw5(a)
Variables:
#self#::#svw5
p::Int64
rand_prod::Float64
x@_4::Any
y::Any
x@_6::Array{Float64,1}
x@_7::Array{Float64,1}

Body:
begin
\$(Expr(:inbounds, false))
# meta: location REPL[4] mult_rand5 2
# meta: location REPL[3] construct_random_vector5 2
x@_6::Array{Float64,1} = \$(Expr(:foreigncall, :(:jl_alloc_array_1d), Array{Float64,1}, svec(Any, Int64), Array{Float64,1}, 0, :(p), 0))
# meta: pop location # line 3:
# meta: location REPL[3] construct_random_vector5 2
x@_7::Array{Float64,1} = \$(Expr(:foreigncall, :(:jl_alloc_array_1d), Array{Float64,1}, svec(Any, Int64), Array{Float64,1}, 0, :(p), 0))
# meta: pop location
# meta: pop location
\$(Expr(:inbounds, :pop))
SSAValue(0) = \$(Expr(:invoke, MethodInstance for dot(::Array{Float64,1}, ::Array{Float64,1}), :(Main.dot), :(x@_6), :(x@_7)))
return SSAValue(0)
end::Float64

julia> @benchmark svw5(\$a)
BenchmarkTools.Trial:
memory estimate:  256 bytes
allocs estimate:  2
--------------
minimum time:     74.900 ns (0.00% GC)
median time:      78.476 ns (0.00% GC)
mean time:        84.285 ns (4.32% GC)
maximum time:     631.964 ns (77.53% GC)
--------------
samples:          10000
evals/sample:     973
``````

That it is worth considering cost benefit of StaticArrays vs type instability (vs throwing an error when the Hessian isnâ€™t full rank).

#2

In general, you donâ€™t want to do dynamic dispatch in your innermost loops. Determine the dimensionality `p` once, in an â€śouterâ€ť function `outer`, and then dispatch to function `inner` that is parameterized by `p` that does all of your core computations. (That is, the expensive loops are in `inner`, not in `outer`.

Your `mult_rand` function is too cheap, for example: you donâ€™t want dynamic dispatch to occur at this point.

Note also that your `construct_random_svector3` calls `randn(p)`, which allocates an array on the heap, killing most of the advantage of using `SVector`. Donâ€™t do heap allocation in your innermost performance-critical functions!

I assume that your `construct_random_svector1` function is supposed to call `randn(p)` like in `construct_random_svector3`? `Vector{Float64}(p)` allocates an uninitialized array, and comparing it to computations that call `randn` is meaningless.

Your `mult_rand` doesnâ€™t actually require any allocation or vectors at all: if you just did `svw5(p) = sum(i -> randn() * randn(), 1:p)` then it would compute exactly the same thing (statistically speaking) with no allocations (and with less code!). Itâ€™s important to unlearn the â€ślessonâ€ť taught by Matlab, NumPy, R, etcetera that â€śvectorizingâ€ť is always a good idea:

``````julia> @btime svw1(5);
470.303 ns (5 allocations: 368 bytes)

julia> @btime svw2(5);
6.921 ÎĽs (57 allocations: 3.61 KiB)

julia> @btime svw3(\$(num{5}));
292.894 ns (3 allocations: 272 bytes)

julia> @btime svw4(5);
291.219 ns (3 allocations: 272 bytes)

julia> @btime svw5(5);
100.656 ns (0 allocations: 0 bytes)
``````

(And the time goes down even further to 70ns if I replace the `sum` call with an explicit `for i = 1:p` loop.)

#3

Probably Iâ€™m going to tell you something that you already know: The usual approach to this problem is to set up a â€śfunction barrier.â€ť The entry point is a nonparameterized function f that doesnâ€™t know p in advance. It constructs the fixed-size vector based on p, which is type-unstable. But then it call another function g parameterized by p in which the fixed-sized vector is an argument. All the computation is carried out by g. This function g has no type instability and runs at full speed.

#4

Thatâ€™s what his `svw4` function is doing, but the problem is that it is too cheap. You want to set up the function barrier around an expensive function, i.e. at a higher/earlier point in the computation, so that the cost of the dynamic dispatch is negligible compared to the cost of the function execution.

#5

Steven, yes, I agree that the OP has a function barrier, and that it isnâ€™t helping performance because the â€śgâ€ť part doesnâ€™t have enough computation relative to the â€śfâ€ť part. Your other post (which crossed in time with mine) makes this very clear.

The other point to make about the OPâ€™s code is that if a type instability is created because a parameter of a parameterized type is unknown at compile time, then type annotation is not helpful. The reason is that the type annotation itself refers to a type that cannot be pinned down at compile time, so the type-check operation has to be dispatched at run time, and knowledge gained from the annotation cannot be used at compile time to specialize the subsequent code.

#6

Oops on the randn()!
Originally I was using randn() in my testing, before switching to Vector just before posting. I missed that one when going back and editing my examples.

My code doesnâ€™t really look like â€śsvwâ€ť, and does take many milliseconds to run at least. Possibly seconds, depending on both p and the function used. I wanted something simple reflecting layered functions to play around with in deciding how to deal with / where to place the dynamic dispatch. svw4 followed the â€śfunction barrierâ€ť approach, which given whatâ€™s been said seems clearly best.
As pointed out, the reason it did not look performant was simply because the inner constructor was too fast in comparison. Compared to the â€śmany milliseconds at leastâ€ť, a 100 or so nanoseconds mean little.

Also, why the type annotations are (worse than) useless despite cleaning up the @code_warntype makes sense. Seems like I ought to have been able to reason that out a priori.

EDIT:
Benchmarking suggests the performance benefits of static arrays more than make up for the cost of a dynamic dispatch. A clear case where allowing for ambiguities is a boon.