For the curious: 90% of excess inference time in my code can be reduced to an (over|ab)use of SArrays
. Here is an
MWE (all latest package versions, Julia 1.6)
using StaticArrays, ForwardDiff, LinearAlgebra, BenchmarkTools
# SETUP
_dual(x, ::Val{N}) where N = ForwardDiff.Dual(x, ntuple(_ -> x, Val(N)))
K = 30 # too large
N = 6
T = Float64
s = _dual.(rand(SVector{K,T}), Val(6));
v = Vector(s);
V = rand(T, K);
# time in a fresh session --- it is mostly compilation time
@time dot(V, s) # mixed case, the worst to infer
@time dot(V, v) # easiest to infer, runs comparably fast
@btime dot($V, $s)
@btime dot($V, $v)
Output
julia> @time dot(V, s) # mixed case, the worst to infer
0.229272 seconds (515.34 k allocations: 31.705 MiB, 5.10% gc time, 99.92% compilation time)
Dual{Nothing}(7.860216810977407,7.860216810977407,7.860216810977407,7.860216810977407,7.860216810977407,7.860216810977407,7.860216810977407)
julia> @time dot(V, v) # easiest to infer, fastest to run
0.027791 seconds (39.30 k allocations: 2.403 MiB, 97.29% compilation time)
Dual{Nothing}(7.860216810977407,7.860216810977407,7.860216810977407,7.860216810977407,7.860216810977407,7.860216810977407,7.860216810977407)
julia> @btime dot($V, $s)
40.695 ns (0 allocations: 0 bytes)
Dual{Nothing}(7.860216810977407,7.860216810977407,7.860216810977407,7.860216810977407,7.860216810977407,7.860216810977407,7.860216810977407)
julia> @btime dot($V, $v)
46.788 ns (0 allocations: 0 bytes)
Dual{Nothing}(7.860216810977407,7.860216810977407,7.860216810977407,7.860216810977407,7.860216810977407,7.860216810977407,7.860216810977407)
Background
The code is for estimating an economic model with a 3D state space using a Smolyak approximation. But most of the tests use a 2D state space and a lower-level Smolyak approximation (K=8 instead of 30–60).
Because it was a runtime improvement and the tests did not display any significant inference lag, we wrote it using SVector
s, often nested 2–3 levels. Then expanding the (static) dimension and adding AD on top of it really taxes the compiler, for little tangible benefit — the actual problem is like the MWE above, but compounded by various type combinations.
Thanks!
Kudos to @timholy and others who contributed to the tooling for introspecting these issues. Learning them required about an afternoon of initial investment for me (for the first round, I am sure there is more to learn), but it really paid off. The workflow is well-documented in recent PRs to SnoopCompile and is comparable to runtime profiling. Having this available in Julia makes the language even more awesome.