First solve of `ODEProblem` and running out of memory: compilation problems?

Hey it’s me again,
I keep encountering memory-related issues with the first call to solve(prob), where prob is an ODEProblem (see details below). While this is probably expected, as it is most likely related to the “time to first plot”-problem, it becomes problematic when the first call actually crashes due to running out-of-memory (OOM), or when the first call takes an extraordinary amount of time.
In my specific case, this is sadly exactly what happens when trying to simulate (relatively) large systems of coupled ODEs. In some recent posts, I have already been investigating some issues related to saving, callbacks, EnsembleProblems, and more, yet this underlying issue has not yet been resolved for me. Sadly, the systems I am interested in are of sizes for which a first call to solve(...) does not resolve as I appear to be running OOM.

To illustrate my problem, as always, I consider the generalized disordered Lotka-Volterra model:

\frac{dx_i}{dt} = x_i \big( 1 - x_i - \sum_{j \neq i} A_{ij} x_j \big)

where entries of the interaction matrix A are random. Code is as follows, using ModelingToolkit.jl and DifferentialEquations.jl;

using DifferentialEquations
using ModelingToolkit
using ModelingToolkit: t_nounits as t, D_nounits as D

using Random
using SparseArrays

# ODESystem
function create_odesystem(S::Int)
    @variables (x(t))[1:S]
    @parameters A[1:S,1:S]

    eqns = [D(x[i]) ~ x[i]*(1 - sum(A[i,:].*x)) for i in 1:S]
    @named sys = ODESystem(eqns, t, [x...], [A])
    return complete(sys)
end

# ODEProblem
function create_odeproblem(
    sys::ODESystem;
    c=0.5, tspan=(0.0,1e3), rng=Random.Xoshiro(1234)
)
    S = length(sys.x)
    #/ Create parameters
    Av = sprandn(rng, S, S, c) ./ sqrt(S)
    for i in 1:S
        Av[i,i] = 1.0
    end
    xv = rand(rng, S)
    pmap = [sys.A => Av]
    xmap = [sys.x => xv]
    #/ Create ODEProblem
    prob = ODEProblem(sys, xmap, tspan, pmap)
    return prob
end

# Solve ODEProblem
function solve_odeproblem(prob::ODEProblem; save=false)
    return solve(
        prob,
        AutoTsit5(TRBDF2()),
        callback=PositiveDomain(save=false),
        save_everystep=save,
        save_start=save
    )
end

Then (note that include(...) statements are excluded for brevity):

julia> S = 16;
julia> sys = create_odesystem(S);
julia> prob = create_odeproblem(sys);

julia> @time sol = solve_odeproblem(prob);
  4.444015 seconds (20.86 M allocations: 1.066 GiB, 7.36% gc time)

julia> @time sol = solve_odeproblem(prob);
  0.007990 seconds (168.19 k allocations: 9.213 MiB)

Note the large no. of allocations and memory usage on the first call. In this case, there’s not a big problem, of course; just do it once, and subsequent calls are very fast. This raises the question whether using PackageCompiler.jl to make a sysimage of DifferentialEquations (or perhaps OrdinaryDiffEq) may be worth it — yet trying it out I did not see any improvements (so far).

The problem is bigger in larger systems

julia> S = 100;
julia> sys = create_odesystem(S);
julia> prob = create_odeproblem(sys);

julia> @time sol = solve_odeproblem(prob);
  8.313252 seconds (42.05 M allocations: 2.202 GiB, 6.66% gc time, 96.34% compilation time)

For my current projects I intend to investigate systems like the one above, but with additional processes/interactions, yet these often need to be investigated relatively large systems, which with my current implementation I simply cannot do.

Note that I am also interested in running many distinct instances, so any potential solutions should also apply to EnsembleProblems.

Questions

  • What is the main reason that the first call to solve takes so much memory? In addition, subsequent calls with a different S also appear to take up quite some time/memory, so I cannot simply call it for, say, S=1, and afterwards investigate a larger system.
  • Is there a way to overcome the memory overhead of the first call somehow? Can I exploit things like PackageCompiler.jl, or other implementations of the code above to drastically reduce the overhead of the first call?

All advice, help and tips are greatly appreciated. Thanks a lot in advance!

I don’t really know ModelingToolkit.jl, but if you’re willing to forgo it, there are certainly ways of improving both the compilation and run-time performance of your code. Here is an example:

using DifferentialEquations
using Random
using SparseArrays
using LinearAlgebra
using PreallocationTools

function lotka_volterra_ode!(dx, x, p, t)

    inter_spec_comp = PreallocationTools.get_tmp(p.cache, x)
    mul!(inter_spec_comp, p.A, x)

    @. dx = x*(1 - inter_spec_comp)
    
    return nothing

end

function make_parameters(S; c = 0.5, rng = Random.Xoshiro(1234))
    A = sprandn(rng, S, S, c) ./ sqrt(S)
    for i ∈ axes(A, 1)
        A[i,i] = 1.0
    end
    x_temp = zeros(S)
    cache = PreallocationTools.DiffCache(x_temp)
    p = (; A, cache)
    return p
end

function solve_odeproblem(prob::ODEProblem; save=false)
    return solve(
        prob,
        AutoTsit5(TRBDF2()),
        callback=PositiveDomain(save=false),
        save_everystep=save,
        save_start=save
    )
end
## --
S = 16
t_end = 1e3

p = make_parameters(S)
x0 = rand(Random.Xoshiro(1234), S)

ode_prob = ODEProblem{true}(lotka_volterra_ode!, x0, (0.0, t_end), p)
## --
@time sol = solve_odeproblem(ode_prob)
@time sol = solve_odeproblem(ode_prob)

The initial allocation is still pretty big (I got ~15 M allocations, ~750 MiB), but it no longer scales with system size S. I think this is because in your code, the RHS function of the ODE allocates quite a lot (A[i,:] and A[i,:].*xboth allocate temporary arrays, and as the system size increases these allocations get larger).

Additionally, while this in and of itself doesn’t affect allocations, the way you’re multiplying a matrix with a vector in pieces by summing over a product of matrix rows and the vector is also not ideal for performance.

I’m not sure what best practices are for writing non-allocating ModelingToolkit code is, hopefully someone else can chime in.

Above, I’ve made the RHS non-allocating by using the mul! function from the LinearAlgebra standard library in conjunction with PreallocationTools.jl, which is a SciML package for managing autodiff-friendly caches.