IDA is allocating a lot

I am using the IDA solver from Sundials.jl to solve a system of DAEs. It works fine, but each time step needs a lot of allocations, even though my callback function does not allocate at all.

MWE:

using KiteModels

kps4::KPS4 = KPS4(KCU(se()))

integrator = KiteModels.init_sim!(kps4; stiffness_factor=0.035, prn=false)
@timev next_step!(kps4, integrator, v_ro = 0, dt=0.05)
nothing

Output:

julia> include("mwes/mwe_02.jl")
  0.003074 seconds (16.70 k allocations: 1.230 MiB)
elapsed time (ns):  3073747
gc time (ns):       0
bytes allocated:    1290000
pool allocs:        16703
non-pool GC allocs: 0
minor collections:  0
full collections:   0

This is for one high-level time-step of 50ms, which needs 50 to 1500 callbacks to find a solution.

Any idea how to reduce the allocations?

I know, I should use ModelingToolkit, but re-writing my model for ModelingToolkit would be a huge task. Is there anything else I can do to improve the speed or reduce the allocations?

You can find the function init_sim!() here: KiteModels.jl/src/KiteModels.jl at 0abe18f6de4b6ba62df145c13423b05584c25817 · ufechner7/KiteModels.jl · GitHub

and the function next_step!() here: KiteModels.jl/src/KiteModels.jl at 0abe18f6de4b6ba62df145c13423b05584c25817 · ufechner7/KiteModels.jl · GitHub

The state vector is has 61 elements of type Float64.

The residual function looks like this:

"""
    residual!(res, yd, y::MVector{S, SimFloat}, s::KPS4, time) where S


    N-point tether model, four points for the kite on top:
    Inputs:
    State vector y   = pos1,  pos2, ... , posn,  vel1,  vel2, . .., veln,  length, v_reel_out
    Derivative   yd  = posd1, posd2, ..., posdn, veld1, veld2, ..., veldn, lengthd, v_reel_outd
    Output:
    Residual     res = res1, res2 = vel1-posd1,  ..., veld1-acc1, ..., 


    Additional parameters:
    s: Struct with work variables, type KPS4
    S: The dimension of the state vector
The number of the point masses of the model N = S/6, the state of each point 
is represented by two 3 element vectors.
"""
function residual!(res, yd, y::Vector{SimFloat}, s::KPS4, time)
    S = length(y)
    y_ =  MVector{S, SimFloat}(y)
    yd_ =  MVector{S, SimFloat}(yd)
    residual!(res, yd_, y_, s, time)
end
function residual!(res, yd, y::MVector{S, SimFloat}, s::KPS4, time) where S
...
end

Can you share an output of the allocation profiler?

I never used it, so I first need to find out how to use it…

First result:

Overhead ╎ [+additional indent] Count File:Line; Function
=========================================================
 ╎7 @Base/client.jl:552; _start()
 ╎ 7 @Base/client.jl:333; exec_options(opts::Base.JLOptions)
 ╎  7 @Base/client.jl:416; run_main_repl(interactive::Bool, quiet::Bool, banner::Bool, history_file::Bool, color_set::Bool)
 ╎   7 @Base/essentials.jl:889; invokelatest
 ╎    7 @Base/essentials.jl:892; #invokelatest#2
 ╎     7 @Base/client.jl:432; (::Base.var"#1013#1015"{Bool, Bool, Bool})(REPL::Module)
 ╎    ╎ 7 …-dot-10/usr/share/julia/stdlib/v1.10/REPL/src/REPL.jl:375; run_repl(repl::REPL.AbstractREPL, consumer::Any)
 ╎    ╎  7 …-dot-10/usr/share/julia/stdlib/v1.10/REPL/src/REPL.jl:389; run_repl(repl::REPL.AbstractREPL, consumer::Any; backend_on_current_task::Bool, backend…
 ╎    ╎   7 …dot-10/usr/share/julia/stdlib/v1.10/REPL/src/REPL.jl:228; kwcall(::NamedTuple, ::typeof(REPL.start_repl_backend), backend::REPL.REPLBackend, cons…
 ╎    ╎    7 …dot-10/usr/share/julia/stdlib/v1.10/REPL/src/REPL.jl:231; start_repl_backend(backend::REPL.REPLBackend, consumer::Any; get_module::Function)
 ╎    ╎     7 …ot-10/usr/share/julia/stdlib/v1.10/REPL/src/REPL.jl:246; repl_backend_loop(backend::REPL.REPLBackend, get_module::Function)
 ╎    ╎    ╎ 7 …ot-10/usr/share/julia/stdlib/v1.10/REPL/src/REPL.jl:150; eval_user_input(ast::Any, backend::REPL.REPLBackend, mod::Module)
 ╎    ╎    ╎  7 @Base/boot.jl:385; eval
 ╎    ╎    ╎   7 @Base/client.jl:489; include(fname::String)
 ╎    ╎    ╎    7 @Base/loading.jl:2136; _include(mapexpr::Function, mod::Module, _path::String)
 ╎    ╎    ╎     7 @Base/loading.jl:2076; include_string(mapexpr::typeof(identity), mod::Module, code::String, filename::String)
2╎    ╎    ╎    ╎ 7 @Base/boot.jl:385; eval
 ╎    ╎    ╎    ╎  1 @Base/compiler/typeinfer.jl:1078; typeinf_ext_toplevel(mi::Core.MethodInstance, world::UInt64)
 ╎    ╎    ╎    ╎   1 @Base/compiler/typeinfer.jl:1082; typeinf_ext_toplevel(interp::Core.Compiler.NativeInterpreter, linfo::Core.MethodInstance)
 ╎    ╎    ╎    ╎    1 @Base/compiler/typeinfer.jl:1051; typeinf_ext(interp::Core.Compiler.NativeInterpreter, mi::Core.MethodInstance)
 ╎    ╎    ╎    ╎     1 @Base/compiler/typeinfer.jl:216; typeinf(interp::Core.Compiler.NativeInterpreter, frame::Core.Compiler.InferenceState)
 ╎    ╎    ╎    ╎    ╎ 1 @Base/compiler/typeinfer.jl:272; _typeinf(interp::Core.Compiler.NativeInterpreter, frame::Core.Compiler.InferenceState)
 ╎    ╎    ╎    ╎    ╎  1 @Base/compiler/optimize.jl:453; optimize
 ╎    ╎    ╎    ╎    ╎   1 @Base/compiler/optimize.jl:504; run_passes
 ╎    ╎    ╎    ╎    ╎    1 @Base/compiler/optimize.jl:489; run_passes(ci::Core.CodeInfo, sv::Core.Compiler.OptimizationState{Core.Compiler.NativeInterpreter}…
 ╎    ╎    ╎    ╎    ╎     1 @Base/compiler/optimize.jl:624; slot2reg(ir::Core.Compiler.IRCode, ci::Core.CodeInfo, sv::Core.Compiler.OptimizationState{Core.Co…
1╎    ╎    ╎    ╎    ╎    ╎ 1 @Base/compiler/ssair/slot2ssa.jl:18; scan_entry!(result::Vector{Core.Compiler.SlotInfo}, idx::Int64, stmt::Any)
 ╎    ╎    ╎    ╎  4 @KiteModels/src/KiteModels.jl:478; kwcall(::@NamedTuple{v_ro::Int64, dt::Float64}, ::typeof(next_step!), s::KPS4{Float64, StaticArraysCor…
 ╎    ╎    ╎    ╎   4 @KiteModels/src/KiteModels.jl:483; next_step!(s::KPS4{Float64, StaticArraysCore.MVector{3, Float64}, 11, 15, KiteModels.Spring{Int16, Fl…
 ╎    ╎    ╎    ╎    4 @SciMLBase/src/integrator_interface.jl:907; step!(integ::Sundials.IDAIntegrator{1, KPS4{Float64, StaticArraysCore.MVector{3, Float64}, …
 ╎    ╎    ╎    ╎     4 …ndials/src/common_interface/integrator_types.jl:247; step!
 ╎    ╎    ╎    ╎    ╎ 4 @Sundials/src/common_interface/solve.jl:1363; solver_step(integrator::Sundials.IDAIntegrator{1, KPS4{Float64, StaticArraysCore.MVecto…
1╎    ╎    ╎    ╎    ╎  4 @Sundials/lib/libsundials_api.jl:3500; IDASolve
 ╎    ╎    ╎    ╎    ╎   3 @Sundials/src/common_interface/function_types.jl:93; idasolfun(t::Float64, u::Ptr{Sundials._generic_N_Vector}, du::Ptr{Sundials._ge…
 ╎    ╎    ╎    ╎    ╎    3 @SciMLBase/src/scimlfunctions.jl:2207; DAEFunction
 ╎    ╎    ╎    ╎    ╎     1 @KiteModels/src/KPS4.jl:395; residual!(res::Vector{Float64}, yd::Vector{Float64}, y::Vector{Float64}, s::KPS4{Float64, StaticArra…
1╎    ╎    ╎    ╎    ╎     1 @KiteModels/src/KPS4.jl:396; residual!(res::Vector{Float64}, yd::Vector{Float64}, y::Vector{Float64}, s::KPS4{Float64, StaticArra…
 ╎    ╎    ╎    ╎    ╎     1 @KiteModels/src/KPS4.jl:397; residual!(res::Vector{Float64}, yd::Vector{Float64}, y::Vector{Float64}, s::KPS4{Float64, StaticArra…
 ╎    ╎    ╎    ╎    ╎    ╎ 1 @KiteModels/src/KPS4.jl:438; residual!(res::Vector{Float64}, yd::StaticArraysCore.MVector{62, Float64}, y::StaticArraysCore.MVec…
1╎    ╎    ╎    ╎    ╎    ╎  1 @Base/array.jl:1021; setindex!
Total snapshots: 8. Utilization: 100% across all threads and tasks. Use the `groupby` kwarg to break down by thread and/or task.

Flamegraph would be much easier. StatProfilerHTML.jl graphs is a nice way to share this.

1 Like

I used this code:

using KiteModels, Profile, StatProfilerHTML

kps4::KPS4 = KPS4(KCU(se()))

integrator = KiteModels.init_sim!(kps4; stiffness_factor=0.035, prn=false)
@timev next_step!(kps4, integrator, v_ro = 0, dt=0.05)
@profilehtml next_step!(kps4, integrator, v_ro = 0, dt=0.05)
nothing

But I do not see the allocations…

You get a whole folder?

Yes. I cannot upload it, not allowed…

I just pushed it to git: KiteModels.jl/statprof.zip at main · ufechner7/KiteModels.jl · GitHub

I added an updated version: KiteModels.jl/statprof2.zip at main · ufechner7/KiteModels.jl · GitHub

based on this code:

using KiteModels, Profile, StatProfilerHTML

kps4::KPS4 = KPS4(KCU(se()))

integrator = KiteModels.init_sim!(kps4; stiffness_factor=0.035, prn=false)
@timev next_step!(kps4, integrator, v_ro = 0, dt=0.05)
function nsteps(kps4, integrator)
    for i=1:10
        next_step!(kps4, integrator, v_ro = 0, dt=0.05)
    end
end
nsteps(kps4, integrator)
Profile.clear()
@profilehtml nsteps(kps4, integrator)
nothing

What is the difference between inclusive and exclusive sample count?

One more question: All of my equations are differential equations, there are no algebraic equations. Does this mean I could use another solver?

One more, simple example:

using Sundials, ControlPlots

G_EARTH  = [0.0, 0.0, -9.81] # gravitational acceleration
  
# Example one: Falling mass
# State vector y   = mass.pos, mass.vel
# Derivative   yd  = mass.vel, mass.acc
# Residual     res = (y.vel - yd.vel), (yd.acc - G_EARTH)     
function res!(res, yd, y, p, time)
    res[1:3] .= y[4:6] .- yd[1:3]
    res[4:6] .= yd[4:6] .- G_EARTH 
end

function solve(integrator, dt, t_final)
    pos_z=Float64[]
    vel_z=Float64[]
    time=Float64[]
    for t in 0:0.05:t_final    
        push!(time, t)
        Sundials.step!(integrator, dt, true)
        push!(pos_z, integrator.u[3])
        push!(vel_z, integrator.u[6])
    end
    time, pos_z, vel_z
end

# Set the initial conditions
t_final = 10.0              # Final simulation time
vel_0 = [0.0, 0.0, 50.0]    # Initial velocity
pos_0 = [0.0, 0.0,  0.0]    # Initial position
acc_0 = [0.0, 0.0, -9.81]   # Initial acceleration
y0 = append!(pos_0, vel_0)  # Initial pos, vel
yd0 = append!(vel_0, acc_0) # Initial vel, acc

differential_vars = ones(Bool, length(y0))
solver  = IDA(linear_solver=:GMRES, max_order = 4)
tspan   = (0.0, t_final) 
abstol  = 0.0006 # max error in m/s and m
s = nothing

prob    = DAEProblem(res!, yd0, y0, tspan, s, differential_vars=differential_vars)
integrator = Sundials.init(prob, solver, abstol, reltol=0.001)

dt = 0.05

@time time_, pos_z, vel_z = solve(integrator, dt, t_final)

# plot the result
plt.ax1 = plt.subplot(111) 
plt.ax1.set_xlabel("time [s]")
plt.plot(time_, pos_z, color="green")
plt.ax1.set_ylabel("pos_z [m]")  
plt.ax1.grid(true) 
plt.ax2 = plt.twinx()  
plt.ax2.set_ylabel("vel_z [m/s]")   
plt.plot(time_, vel_z, color="red")

@time time_, pos_z, vel_z = solve(integrator, dt, 2*t_final)
nothing

If I run this I get (second run):

  0.001271 seconds (10.12 k allocations: 533.000 KiB)

Why so many allocations for such a simple problem? Is there a solver that can solve this cheaper?

Or in other words: Are there other solvers in DifferentialEquations.jl that can handle implicit differential equations that are NOT DAE’s?

In your first code, it seems the allocations are coming from your norm function?

In the second code, G_EARTH is a non-constant global.

1 Like

You are allocating the slices here, try using views

2 Likes

I applied your suggestions:

using Sundials

const G_EARTH  = [0.0, 0.0, -9.81] # gravitational acceleration
const dt = 0.05
const t_final = 10.0
  
# Example one: Falling mass
# State vector y   = mass.pos, mass.vel
# Derivative   yd  = mass.vel, mass.acc
# Residual     res = (y.vel - yd.vel), (yd.acc - G_EARTH)     
function res!(res, yd, y, p, time)
    @views res[1:3] .= y[4:6] .- yd[1:3]
    @views res[4:6] .= yd[4:6] .- G_EARTH 
end

struct Result
    time::Vector{Float64}
    pos_z::Vector{Float64}
    vel_z::Vector{Float64}
end
function Result(t_final)
    n=Int64(round(t_final/dt+1))
    Result(zeros(n), zeros(n), zeros(n))
end

function solve!(res, integrator, dt, t_final)
    for (i,t) in pairs(0:dt:t_final)
        res.time[i] = t
        Sundials.step!(integrator, dt, true)
        res.pos_z[i] = integrator.u[3]
        res.vel_z[i] = integrator.u[6]
    end
    nothing
end

function init()
    vel_0 = [0.0, 0.0, 50.0]    # Initial velocity
    pos_0 = [0.0, 0.0,  0.0]    # Initial position
    acc_0 = [0.0, 0.0, -9.81]   # Initial acceleration
    y0 = append!(pos_0, vel_0)  # Initial pos, vel
    yd0 = append!(vel_0, acc_0) # Initial vel, acc

    differential_vars = ones(Bool, length(y0))
    solver  = IDA(linear_solver=:GMRES, max_order = 4)
    tspan   = (0.0, t_final) 
    abstol  = 0.0006 # max error in m/s and m
    s = nothing

    prob    = DAEProblem(res!, yd0, y0, tspan, s, differential_vars=differential_vars)
    integrator = Sundials.init(prob, solver, abstol, reltol=0.001)
end

integrator=init()
res=Result(2*t_final)
@time solve!(res, integrator, dt, t_final)
bytes = @allocated solve!(res, integrator, dt, 2*t_final)
n=Int64(round(t_final/dt+1))
println("Allocated $(Int64(round(bytes/n))) bytes per iteration!")

This prints:

julia> include("mwes/mwe_04.jl")
  0.096097 seconds (174.18 k allocations: 11.615 MiB, 99.74% compilation time)
Allocated 1254 bytes per iteration!

So I still get 1254 bytes of allocation for each 50 ms simulation step.

I will try alternative solutions.

gmres would allocate I think unless Sundials catches the krylov sbspace

if you don’t have any algebraic equations, you definitely should be using a different solver. probably FBDF

Instead of speculating, we should just run the allocation profiler again. Note that allocations tracks Julia allocations that the GC is aware of, so the Krylov ones won’t actually be tracked.

I think the issue is that he has written algebraic expressions of the derivative terms. The obvious answer there is to use ModelingToolkit as that would also improve the numerics, since you generally never want to solve in the implicit DAE form if you don’t have to, but OP gave extra constraints for fun :sweat_smile:. But yes, in a serious project you’d probably just fix that if you’re worried about robustness and performance.

I tested a lot of different methods, and so far IDA is the best.

  • IDA using step!() function: 938 bytes per iteration

Source code: KiteModels.jl/mwes/mwe_04.jl at main · ufechner7/KiteModels.jl · GitHub

Different approach with calling solve on each iteration:

  • IDA: 15206 bytes
  • DFBDF: 30124 bytes
  • DImplicitEuler: 27589 bytes
  • DABDF2: 27855 bytes
  • daskr: 37059 bytes

Source code: KiteModels.jl/mwes/mwe_07.jl at main · ufechner7/KiteModels.jl · GitHub

Is the amount of allocations the only thing that matters?

1 Like

Well, I want to have a (soft) realtime simulation. Using IDA I can turn off the garbage collector and run the simulation for about 200s realtime without going out-of memory.
I consider to buy another 32 GB RAM stick such that I could run the simulation for 400s … But I just think there must be a better solution.

Of course performance also matters, and less iterations cause also less allocations.

I also considered to use a pre-conditioner, but did not find find a starting point for doing this…

1 Like

if this is a simulation, does it matter if there is a GC pause every 100 seconds or so?

1 Like

If it matters - well, the video starts to stutter when it starts allocating… You see here the simulation time per 50ms time step… I turn the GC back on when I have less than 2GB free RAM…

As long as I do just simulations and no real-time flight control it is just annoying, no real problem… But at some point I would like to do real-time flight control…

My callback function has zero allocations, and I would like to have a solver that does not allocate… There is this open issue: Implicit Radau5DAE · Issue #1406 · SciML/OrdinaryDiffEq.jl · GitHub

I think I could make this solver non-allocating, but so far I am not familiar enough with the abstract interfaces to port it to Julia… A Python and a Fortran version exist…

I mean, a real-time capable GC would also help… I asked for it once and was told I should ask again when I have at least 1 million funding… I don’t understand why it is so difficult to implement a GC call that has a time-out as parameter and stops cleaning up when his time budget is used up, but again, I am not a GC expert…

1 Like