Distributed ODE solving

Hi all,

I am interested in been able to manually solve ODEs distributed as below. I am aware of ensembledistributed() however I want greater control over the outcome of each trajectory. My thinking is then to solve a serial process across a for loop which I distribute as I do not care about the order of the outputs.

My actual problem is loading a model using CellMLToolKit to load a model and formulate the problem that way. My aim is to solve across 2 nodes each with 64 cores. Are there any tips for doing things this way the cluster admins are fairly militant thus I thought better check first. MWE and Slurm script below.

CODE

using Distributed
addprocs(128)

@everywhere using OrdinaryDiffEq, QuasiMonteCarlo, JLD

@everywhere function f(du, u, p, t)
    du[1] = p[1] * u[1] - p[2] * u[1] * u[2] #prey
    du[2] = -p[3] * u[2] + p[4] * u[1] * u[2] #predator
end

@everywhere u0 = [1.0; 1.0]
@everywhere tspan = (0.0, 10.0)
@everywhere p = [1.5, 1.0, 3.0, 1.0]
@everywhere prob = ODEProblem(f, u0, tspan, p)
@everywhere t = collect(range(0, stop = 10, length = 200))

@everywhere function f_S(p)
    prob_ = remake(prob, p = p)
    sol = solve(prob_, Tsit5(), saveat = t)
    [mean(sol[1,:]); mean(sol[2,:])]
end

# Generate parameter samples
@everywhere samples = 50000
@everywhere lb = [1.0, 1.0, 1.0, 1.0]
@everywhere ub = [5.0, 5.0, 5.0, 5.0]
@everywhere sampler = SobolSample()
@everywhere A, B = QuasiMonteCarlo.generate_design_matrices(samples, lb, ub, sampler)

# Analytical indices set up #
@everywhere N = 50000  # samples
@everywhere D = 4    # parameters
@everywhere M = 2    # outputs

# Initialize results matrices  
@everywhere y_AB = zeros(M, N, D) 

# A_B Matrix evaluation # 
@distributed for i in 1:N
    for j in 1:D
        AB = copy(A[:,i])
        AB[j] = B[j,i]
        y_AB[:,i,j] = f_S(AB)
    end
end

save("y_AB.jld","data",y_AB)

Slurm Script

#!/bin/bash
#SBATCH --job-name=Dis
#SBATCH --time=96:00:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=64
#SBATCH --output="Log.txt"
#SBATCH --error="err.txt"

#/users/ac1hsax/julia-1.10.5/bin/julia --threads 128  CourtDis.jl 
1 Like

I’m not entirely sure what the question is, but if you are doing distributed then you don’t need to set threads, and you probably want to set blas threads 1.

1 Like

There was no real question more looking for advice on best practices.

Searching about I could not find many examples of distributed computing for ode systems outside of EnsembleDistributed.

On the docs it says solves over 1milliesecond is suitable for this, are there any benchmarks on these sorts of problems?

Some of the older adaptive SDE benchmarks set that.

I have been playing around with a more complex example over 4 nodes with 64 cores each. However I see no speed up compared to running on a single node with 64 threads. Is there anything obvious which I am missing?

using Distributed
using ClusterManagers
using SharedArrays

# Initialize cluster
addprocs(256)

@everywhere begin
   using OrdinaryDiffEq, ModelingToolkit, CellMLToolkit, QuasiMonteCarlo, JLD, LinearAlgebra
   ml = CellModel("courtemanche_ramirez_nattel_1998.cellml");
   prob = ODEProblem(ml, (0.0,2000.0))
   x = LinRange(1000, 2000, 2000)

   # Serial Execution #
   function f_S(p)
       prob_ = remake(prob, p = p)
       sol = solve(prob_, BS3(), reltol = 1e-6, abstol = 1e-6, saveat = x)
       if sol.retcode == ReturnCode.Success
           [maximum(sol[5,:]); maximum(sol[16,:])]
       else
           [NaN; NaN]
       end
   end

   const N = 60_000  # samples
   const D = 90    # parameters
   const M = 2    # outputs

   const lb =[279.0, 0.002142, 0.72, 090.0, 0.090, 126.0, 4.860, 2.7, 0.045, 0.0045, 0.002142, 086.83803, 0.0010179, 45.0, 090.0, 0.0264705885, 1.62, 279.0, 279.0, 7.48287, 1440.0, 27.0, 7.48287, 0.111375, 7.48287, 09.0, 279.0, 090.0, 086.83803, 0.00060699375, 0.116470584, -2200.0, 090.0, 0.000828, 279.0, 7.48287, 09.0, 09.0, 2.7, 18090.0, 090.0, 0.045, 0.063, 0.315, 1.62, 2.7, 78.75, 7.48287, 0.14868, 162.0, 0.0045, 0.00045, 090.0, 7.48287, 2.7, 13.5, 0.081, 2.7, 0.0, 1.8, 279.0, 90.0, 0.063, 0.2475, 126.0, 2.7, 086.83803, 090.0, 126.0, 086.83803, 4.86, 0.00045, 086.83803, 0900.0, 45000.0, 126.0, 7.02, 4.86, 090.0, 1.242, 1.62, 090.0, 086.83803, 1.35, 086.83803, 090.0, 090.0, 086.83803, 0.72, 0.539404866]
   const ub =[341.0, 0.002618, 0.88, 110.0, 0.110, 154.0, 5.941, 3.3, 0.055, 0.0055, 0.002618, 106.13537, 0.0012441, 55.0, 110.0, 0.0323529415, 1.98, 341.0, 341.0, 9.14573, 1760.0, 33.0, 9.14573, 0.136125, 9.14573, 11.0, 341.0, 110.0, 106.13537, 0.00074188125, 0.142352936, -1800.0, 110.0, 0.001012, 341.0, 9.14573, 11.0, 11.0, 3.3, 22110.0, 110.0, 0.055, 0.077, 0.385, 1.98, 3.3, 96.25, 9.14573, 0.18172, 198.0, 0.0055, 0.00055, 110.0, 9.14573, 3.3, 16.5, 0.099, 3.3, 0.0, 2.2, 341.0, 110., 0.077, 0.3025, 154.0, 3.3, 106.13537, 110.0, 154.0, 106.13537, 5.94, 0.00055, 106.13537, 1100.0, 55000.0, 154.0, 8.58, 5.94, 110.0, 1.518, 1.98, 110.0, 106.13537, 1.65, 106.13537, 110.0, 110.0, 106.13537, 0.88, 0.659272614]
   const sampler = SobolSample()
   const A, B = QuasiMonteCarlo.generate_design_matrices(N, lb, ub, sampler)  
end

# Initialize results matrices
y_AB = SharedArray{Float64}(M, N, D)  

# AB Matrix evaluation
@time @sync @distributed for i in 1:N
   for j in 1:D
       AB = copy(A[:,i])
       AB[j] = B[j,i]
       try
           y_AB[:,i,j] = f_S(AB)
       catch
           y_AB[:,i,j] = [NaN;NaN]
       end
   end
end

# Save results with error handling
save("y_AB.jld", "data", convert(Array, y_AB))

# Cleanup
rmprocs(workers())

Where the file is from CellML here. Interestingly I do see significant speed ups if I modify my first example accordingly.

In this form you get the overhead of distributed per call. You should instead batch it. I.e. for i_1 in 1:256, then for i in N/256 though you’d need to make it be i for a specific subset calculation, etc. That would greatly reduce the overhead.

You can also reduce the overhead by using 4 procs with 64 threads each, and then you need to setup the batching there.

1 Like