Interpolation of callbacks is slow and blows up memory in DifferentialEquations.jl

I’ve found that the interpolation in the Callbacks interface of DifferentialEquations.jl can slow down performance and blow up the memory allocations.

To demonstrate, I’ll use the cosine/sine example from the Callbacks doc. I’ve changed the example by making the ‘u’ parameter five times as long, to exacerbate the slow performance

using DifferentialEquations, StaticArrays, BenchmarkTools

function fun(u, p, t)
    @SVector [u[2], -u[1], u[4], -u[3], u[6], -u[5], u[8], -u[7], u[10], -u[9]]

tspan = (0.0,1000.0)
u0 = @SVector [1., 0., 1., 0., 1., 0., 1., 0., 1., 0.]
prob = ODEProblem(fun,u0,tspan)

Benchmarking shows that this ODE without any callbacks takes about 1ms to complete.

@benchmark sol = solve(prob, DP5(), save_everystep=false)

> BenchmarkTools.Trial: 
>   memory estimate:  128.73 KiB
>   allocs estimate:  2290
>   --------------
>   minimum time:     488.000 μs (0.00% GC)
>   median time:      1.054 ms (0.00% GC)
>   mean time:        1.096 ms (3.09% GC)
>   maximum time:     8.769 ms (87.27% GC)
>   --------------
>   samples:          4524
>   evals/sample:     1

Next, I solve the ODE with a simple callback. I use a dummy condition that always returns -1. Thus, this callback is never initiated, and little to no processing should be done. However, the memory allocations blow up ~1000x and the runtime is ~80x slower!

condition(u,t,integrator) = -1.
affect!(integrator) = terminate!(integrator)
cb = ContinuousCallback(condition, affect!, save_positions=(false, false))

@benchmark sol = solve(prob, DP5(), save_everystep=false, callback=cb)

> BenchmarkTools.Trial: 
>   memory estimate:  138.09 MiB
>   allocs estimate:  2685005
>   --------------
>   minimum time:     57.506 ms (38.04% GC)
>   median time:      82.732 ms (34.48% GC)
>   mean time:        83.229 ms (35.62% GC)
>   maximum time:     110.638 ms (24.88% GC)
>   --------------
>   samples:          61
>   evals/sample:     1

I believe the performance bottleneck is the ‘interp_points’ parameter of the callback, which is set to 10 by default. At each evaluated time point, I am guessing that 10 new arrays (each of length 10) are being created. Repeated many times, this creates a big bottleneck.

If I set ‘interp_points’ to 0, then the solver becomes fast again.

cb = ContinuousCallback(condition, affect!, save_positions=(false, false), interp_points=0)

@benchmark sol = solve(prob, DP5(), save_everystep=false, callback=cb)
> BenchmarkTools.Trial: 
>   memory estimate:  128.22 KiB
>   allocs estimate:  2285
>   --------------
>   minimum time:     506.000 μs (0.00% GC)
>   median time:      1.125 ms (0.00% GC)
>   mean time:        1.141 ms (2.89% GC)
>   maximum time:     9.762 ms (84.72% GC)
>   --------------
>   samples:          4345
>   evals/sample:     1

Question: How can I make the callback run faster while still having interpolation? Would it be possible to change the Callback implementation so that it writes to the same memory every time (rather than allocate new memory)? Perhaps each instance of a ContinuousCallback struct can store a field called interp_u that is re-used again and again.

It shouldn’t be allocating since it’s using SVectors. It sounds like using static arrays introduces a type instability in the callback code. Could you open an issue? @YingboMa or I can take a closer look. There are some issues with SVectors that we had to work around in other parts of the code (specifically v4.18.1 introduces major regression for out-of-place SVector integration · Issue #571 · SciML/OrdinaryDiffEq.jl · GitHub ) and it sounds like we need to do similar changes to the callback code.

And for reference, here’s the StaticArrays performance bug that is likely being hit: Very slow broadcast · Issue #560 · JuliaArrays/StaticArrays.jl · GitHub

So to answer the other parts of the question:

With static arrays you don’t have to do that, because allocating static arrays is not supposed to be a heap allocation. That’s what this issue is. When you use the in-place function f(du,u,p,t), then it does have cached arrays that are used so that way it uses the same memory every time (but of course you don’t want to and can’t do that with static arrays). So it is just an issue of optimization on this type.

1 Like

Thanks for looking at this and the explanation! I’ve created an issue here

1 Like

Is that true even if the u is huge (say > 10000)?

Yes, we are talking about stack allocated static arrays. But of course, if you made a static array of that size your code would never finish compiling and it overfill the stack making it slower than you’d expect, if it works at all.