JumpProblem with large number of jumps crashes

Hello,

I am trying to solve a JumpProblem defined on a network.

The problem consists of a SIR model in which I am trying to model every node of the network separately.

I have a working code that works with a small network (around 100 nodes), as I try to increase the size of nodes the code fails raising StackOverflow error.

See a minimal working version below:

using LightGraphs
using DifferentialEquations
using Plots
using Distributions
using SparseArrays
using LinearAlgebra

@info "Initializing system..."
degree = 5
kdist = Poisson(degree)
N = Int(100)
ks = 1
while sum(ks) % 2 != 0
    global ks = rand(kdist, N)
end

g = random_configuration_model(N, ks)

i₀ = rand(1:N, round(Int, N*0.1)) 
x₀ = [ones(N); zeros(N); zeros(N)]
x₀[i₀] .= 0
x₀[i₀ .+ N] .= 1
tspan = (0.0, 100.0)
p = [0.25, 0.05, degree, adjacency_matrix(g)]

function agg_sol(sol, N)
    agg = reshape(sol[:, :], (N, 3, :))
    agg = mean(agg; dims=1)
    agg = dropdims(agg; dims=1)
    agg = permutedims(agg, (2, 1))
    return agg
end


function dNᵢ(i)

    function rate(u, p, t)
        u[i]*(p[1]/p[3])*dot(p[4][i, :], u[(N+1):2N])
    end

    function affect!(integrator)
        integrator.u[i] -= 1
        integrator.u[N+i] += 1
    end

    return rate, affect!

end

function dMᵢ(i)

    function rate(u, p, t)
        u[N+i]*p[2]
    end

    function affect!(integrator)
        integrator.u[N+i] -= 1

        integrator.u[2N+i] += 1
    end

    return rate, affect!

end

@info "Collecting jumps..."
jumps = ConstantRateJump[]
for i in 1:N
    push!(jumps, ConstantRateJump(dNᵢ(i)...))
end

for i in 1:N
    push!(jumps, ConstantRateJump(dMᵢ(i)...))
end

@info "Building dependency graph..."
vtoj = Vector{Vector{Int64}}(undef, 3N)
for i in 1:N
    vtoj[i] = [i]
    vtoj[N+i] = [findnz(p[4][i, :])[1]; N+i]
    vtoj[2N+i] = []
end
vtoj

jtov = Vector{Vector{Int64}}(undef, length(jumps))
for i in 1:N
    jtov[i] = [i, N+i]
    jtov[N+i] = [N+i, 2N+i]
end
jtov

@info "Building discrete problem..."
discrete_prob = DiscreteProblem(x₀, tspan, p)

@info "Building jump problem..."
# see https://github.com/SciML/ModelingToolkit.jl/blob/f12f472f630fd85a6fab4ca547b1c679217c33df/src/systems/jumps/jumpsystem.jl
# see https://diffeq.sciml.ai/stable/types/jump_types/
jump_prob = JumpProblem(discrete_prob, RSSA(), JumpSet(jumps...);
    vartojumps_map=vtoj, jumptovars_map=jtov)

@info "Solving problem..."
jump_sol = solve(jump_prob)

@info "Plotting..."
plot(agg_sol(jump_sol, N));
ylims!(0, 1);
plot!(legend=:right);
savefig("test3.png")

@info "Done."

If I try to increase the number of nodes above to k=1000 the code crashes. The end of the error (as it is otherwise too big to reproduce here) is copied below. I am not sure why StackOverflowError is being raised since the JumpSet is reasonably small (only about 2000 functions).

My intuititon tells me that to simulate this model one only needs to update the rate of the whole model every time an event happens. Given that the dependency graph is provided, it should not be such an expensive operation or one that involves too many iterations.

The simulation could be slow if there are too many updates (as is the case here), but at least it should not crash after every update.

...
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2237 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2419
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1703 [inlined]
true_main at /buildworker/worker/package_linux64/build/src/jlapi.c:560
repl_entrypoint at /buildworker/worker/package_linux64/build/src/jlapi.c:702
main at julia (unknown line)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x4007d8)
ERROR: LoadError: StackOverflowError:
Stacktrace:
 [1] split_jumps(::Tuple{}, ::NTuple{983, ConstantRateJump{var"#rate#1"{Int64}, var"#affect!#2"{
Int64}}}, ::Nothing, ::Nothing, ::ConstantRateJump{var"#rate#1"{Int64}, var"#affect!#2"{Int64}},
 ::ConstantRateJump{var"#rate#1"{Int64}, var"#affect!#2"{Int64}}, ::Vararg{Any, N} where N) (rep
eats 984 times)
   @ DiffEqJump ~/.julia/packages/DiffEqJump/5mkeM/src/jumps.jl:127
 [2] JumpSet(::ConstantRateJump{var"#rate#1"{Int64}, var"#affect!#2"{Int64}}, ::Vararg{DiffEqJum
p.AbstractJump, N} where N)
   @ DiffEqJump ~/.julia/packages/DiffEqJump/5mkeM/src/jumps.jl:106
 [3] top-level scope
   @ ~/phd/qe/test3.jl:106
in expression starting at /home/gzagatti/phd/qe/test3.jl:106

split_jumps is recursive and uses splatting, so that could be the issue here. Can you open an issue on DiffEqJump.jl?

@isaacsas thanks for the help, I have raised the issue in Github.

Thanks! I’ll take a look later today.

Does this fix your issue? This avoids splatting by passing your array of ConstantRateJumps directly into the JumpSet:

jset = JumpSet(; constant_jumps=jumps)
jump_prob = JumpProblem(discrete_prob, RSSA(), jset;
    vartojumps_map=vtoj, jumptovars_map=jtov)

A couple other performance-related suggestions. First, you should use a single MassActionJump to encode all your dMᵢ; for many jumps it should offer a performance improvement. Second, in the rate function of dNᵢ you should use views for the array slices to avoid allocations. Also note you might consider transposing p[4] so you are using columns of it instead of rows to match how Julia stores matrices in memory. Finally, since you have a pure jump problem you should use the SSAStepper time-stepper:

jump_sol = solve(jump_prob, SSAStepper())
1 Like

I am really thankful for your help. The code now not only runs without crashing but with your suggestions runs 30x faster than with the old implementation — a considerable improvement.

I have implemented your suggestions as below (with the updated API).

using LightGraphs
using DifferentialEquations
using Plots
using Distributions
using SparseArrays
using LinearAlgebra
using BenchmarkTools

@info "Initializing system..."
degree = 5
kdist = Poisson(degree)
N = Int(1000)
ks = 1
while sum(ks) % 2 != 0
    global ks = rand(kdist, N)
end

g = random_configuration_model(N, ks)

i₀ = rand(1:N, round(Int, N*0.1)) 
x₀ = [ones(N); zeros(N); zeros(N)]
x₀[i₀] .= 0
x₀[i₀ .+ N] .= 1
tspan = (0.0, 100.0)
p = [0.25, 0.05, degree, adjacency_matrix(g)]

function agg_sol(sol, N)
    agg = reshape(sol[:, :], (N, 3, :))
    agg = mean(agg; dims=1)
    agg = dropdims(agg; dims=1)
    agg = permutedims(agg, (2, 1))
    return agg
end

@info "Collecting jumps..."
function dNᵢ(i)

    @views function rate(u, p, t)
        @inbounds u[i]*(p[1]/p[3])*dot(p[4][:, i], u[(N+1):2N])
    end

    function affect!(integrator)
        integrator.u[i] -= 1
        integrator.u[N+i] += 1
    end

    return rate, affect!

end

infections = ConstantRateJump[]
for i in 1:N
    push!(infections, ConstantRateJump(dNᵢ(i)...))
end

recoveries_reactant = Vector{Pair{Int, Int}}[]
recoveries_net = Vector{Pair{Int, Int}}[]
for i in 1:N
    push!(recoveries_reactant, [N+i => 1])
    push!(recoveries_net, [N+i => -1, 2N+i => +1])
end
recoveries = MassActionJump(repeat([p[2]], N), recoveries_reactant, recoveries_net; scale_rates=false)

@info "Building dependency graph..."
vtoj = Vector{Vector{Int64}}(undef, 3N)
for i in 1:N
    @inbounds vtoj[i] = [N+i]
    @inbounds vtoj[N+i] = [N .+ findnz(p[4][:, i])[1]; i]
    @inbounds vtoj[2N+i] = []
end

jtov = Vector{Vector{Int64}}(undef, 2N)
@views for i in 1:N
    @inbounds jtov[i] = [N+i, 2N+i]
    @inbounds jtov[N+i] = [i, N+i]
end

jtoj = Vector{Vector{Int64}}(undef, 2N)
for i in 1:N
    jtoj[i] = [N .+ findnz(p[4][i, :])[1]; N + i; i]
    jtoj[N+i] = [N .+ findnz(p[4][i, :])[1]; N + i]
end

@info "Building discrete problem..."
discrete_prob = DiscreteProblem(x₀, tspan, p)

@info "Building jump problem..."
# see https://github.com/SciML/ModelingToolkit.jl/blob/f12f472f630fd85a6fab4ca547b1c679217c33df/src/systems/jumps/jumpsystem.jl
# see https://diffeq.sciml.ai/stable/types/jump_types/
js = JumpSet(; constant_jumps=infections, massaction_jumps=recoveries)
# jump_prob = JumpProblem(discrete_prob, RSSA(), js; vartojumps_map=vtoj, jumptovars_map=jtov)
jump_prob = JumpProblem(discrete_prob, DirectCR(), js; dep_graph=jtoj)

@info "Solving problem..."
@btime solve(jump_prob, SSAStepper())
jump_sol = solve(jump_prob, SSAStepper())

@info "Plotting..."
plot(agg_sol(jump_sol, N));
ylims!(0, 1);
plot!(legend=:right);
savefig("test5.png")

@info "Done."

With the previous implementation (without MassActionJump and the views), I got the following performance with 1,000 nodes:

182.444 ms (201195 allocations: 172.55 MiB)

With the new implementation, I got the following performance :

6.797 ms (92786 allocations: 24.99 MiB)

I can even run with 10,000 nodes in under 1 seconds and 2.2Gb allocation. (With 100,00 the program crashes, likely because I don’t have enough memory).

With regards to performance, I was not sure whether @views and @inbounds are redundant or I need both. Would you be able to tell which one is correct?

Do you see any other low hanging fruits for improving performance?

Finally, I would like to share that my pain point was figuring out the order of the jumps to build the dependency graphs. It is nowhere documented in which order the mass and constant jumps are put together in the JumpSet. The structure itself also does not have any method that gives me the mapping between jumps and their indices.

The way I got around this was to use Catalyst.jl to build an equivalent model and use the functions asgraph, variable_dependencies and eqeq_dependencies to pierce those mappings together. (I am happy to share the Catalyst.jl code for reference). It would be great to have an easier way to do this or perhaps improve the documentation with a discussion on that.

Generally I would think Catalyst is just the easiest way to do this. It automates a lot of the little optimizations, and it would be hard to automate them in other ways (and possibly not worth the work, given that Catalyst exists)

We still need work on JumpSystem memory use / conversion performance in ModelingToolkit, which could become an issue at large enough system sizes. So probably this is the right approach for now.

@gzagatti if you want to help out, a PR to the docs at GitHub - SciML/DiffEqDocs.jl: Documentation for the DiffEq differential equations and scientific machine learning (SciML) ecosystem clarifying how constant rate and mass action jumps are internally ordered would be super helpful for other users!

What happens to your memory use if you don’t save the state with every single jump? (i.e. in JumpProblem pass the kwarg save_positions=(false,false) and then use saveat=some_time_step in solve.) I think you should be able to get to more than 10K jumps.

Also, I would suggest playing around with different SSAs for your problem; it might be that DirectCR, NRM or RSSACR work better than RSSA with your graph structure.

I tried different algorithms, and actually DirectCR was the best one between RSSA and DirectCR by a substantial margin. The results I reported are for DirectCR.

@ChrisRackauckas, my first approach was actually with Catalyst.jl as I thought it would save me a lot of work. Unfortunately, it did not scale. Problem set up with as little as 100 nodes was really slow, and it crashed when solved. For future reference this is the script:

using LightGraphs
using DifferentialEquations
using Plots
using GraphPlot
using LinearAlgebra
using Distributions
using Catalyst
using ModelingToolkit

@info "Initializing system..."
degree = 5
kdist = Poisson(degree)
N = Int(20)
ks = 1
while sum(ks) % 2 != 0
    global ks = rand(kdist, N)
end

g = random_configuration_model(N, ks)

i₀ = rand(1:N, round(Int, N*0.1)) 
x₀ = [ones(N); zeros(N); zeros(N)]
x₀[i₀] .= 0
x₀[i₀ .+ N] .= 1
tspan = (0.0, 100.0)
p = [0.25, 0.05, degree, adjacency_matrix(g)]

function agg_sol(sol, N)
    agg = reshape(@view(sol[:, :]), (N, 3, :))
    agg = mean(agg; dims=1)
    agg = dropdims(agg; dims=1)
    agg = permutedims(agg, (2, 1))
    return agg
end

@info "Initializing variables..."
@parameters t
@variables β γ k A[1:N, 1:N] x[1:3N](t)

@info "Collecting reactions..."
rx = []
for i in 1:N
    push!(rx, Reaction(
            x[i]*(β/k)*dot(A[:, i], x[(N+1):2N]), 
            [x[i]], 
            [x[N+i]];
            only_use_rate=true)
    )
end

for i in 1:N
    push!(rx, Reaction(
            γ,
            [x[N+i]],
            [x[2N+i]];
            only_use_rate=false)
    )
end

@info "Building reaction system..."
rs = ReactionSystem(rx, t, x, reduce(vcat, [[β, γ, k], vec(A)]))

@info "Converting to jump system..."
jump_sys = convert(JumpSystem, rs; combinatoric_ratelaws=false)

@info "Building discrete problem..."
discrete_prob = DiscreteProblem(
    jump_sys, 
    Pair.(x, x₀), 
    tspan,
    Pair.(
        reduce(vcat, [[β, γ, k], vec(A)]), 
        reduce(vcat, [p[1:3], vec(adjacency_matrix(g))])
    )
)

@info "Building jump problem..."
jump_prob = JumpProblem(jump_sys, discrete_prob, DirectCR())

@info "Solving problem..."
jump_sol = solve(jump_prob, SSAStepper())

With that being said, Catalyst.jl was really useful in model exploration. It helped to figure out how to describe the problem in DiffEqJump.jl. The equations output and the Graphviz plots convinced me that I had set up the problem correctly.

@isaacsas I will try to help with the documentation. It is good practice to get more acquainted with the code base. Will get back to it from Monday.

Open an issue with this example. It would be good to work on.

Thanks for the doc help!

And please do open an issue if you could describing the Catalyst problems. I’m guessing they are related to the relatively long rate laws you’ve got for the first type of reactions, but it would be good for us to understand where it is slow.

I have opened the issue in Github.

1 Like