I’m working on an application that performs optimization of a complex ODE by iteratively optimizing over the linearized system. The optimization works fine, but sensitivity analysis of the ODE has very strange performance behaviour. Currently, I’m doing this via applying ForwardDiff to a DifferentialEquations ODEProblem. Once compilation has happened, Julia’s very fast, but issues arise on the first run. The program can be found here.
On the first run, the program can take more than 90 seconds on fast hardware (3GHz i7) to compile. This only happens when trying to differentiate the integrator - the ODE itself is fast to compile - and once compiled, performance is very good. Analysis of precompilation time with SnoopCompile indicates some abysmally long compile times for specific functions, with the following taking more than 7 seconds:
Base.Math.muladd(typeof(Base.muladd), StaticArrays.SArray{Tuple{14}, ForwardDiff.Dual{ForwardDiff.Tag{getfield(Dynamics, Symbol(\"#f#7\")){Main.ProbInfo, Float64}, Float64}, Float64, 21}, 1, 14}, ForwardDiff.Dual{ForwardDiff.Tag{getfield(Dynamics, Symbol(\"#f#7\")){Main.ProbInfo, Float64}, Float64}, Float64, 21}, StaticArrays.SArray{Tuple{14}, ForwardDiff.Dual{ForwardDiff.Tag{getfield(Dynamics, Symbol(\"#f#7\")){Main.ProbInfo, Float64}, Float64}, Float64, 21}, 1, 14})
I’ve also tried straight profiling, but the program fails to terminate when running under the profiler.
The long compile time makes testing the program unpleasant, and it would be nice if there was a way to redesign it to have a faster compile while not sacrificing compiled performance. I’ve had a few ideas:
- Stop using SArray, which should improve performance of the inliner and constant propagator at the expense of runtime speed. Could recover performance from very careful use of in place modification.
- Use DifferentialEquation’s sensitivity analysis instead of ForwardDiff on a normal ODE. However, the function being integrated is not amenable to symbolic differentiation in the general case, and the documentation isn’t entirely clear on how to define it completely numerically.
I feel like I’m running into a particularly degenerate case with this program, and suspect that there’s something minor I could change to solve this problem. Is there any easy way to improve the compile performance? Precompilation would work for the deployed version of the program, but does not solve the development-time compile performance problem.