Simplest way to convert a program for parallel (multithreaded) runs on multiple servers/cores?

CosmoProf · February 16, 2021, 12:08am

Hi again all, I am running a collection of programs doing intensive numerical calculations (e.g., ~1-2 days of runtime on a system with 64 GB of memory and CPU > 2.1 GHz).

I have an opportunity to save tremendous resources (and money) by adapting my programs to run on a locally available system with multiple servers & cores. But completely re-writing my pre-existing programs (over 5,000 lines of code) to adapt them to this parallel system would be a prohibitive amount of work, and seems impossible to do without introducing numerous errors.

(Also, I don’t know how to do it; something called “multi-threading” seems to be involved, and my only pointer so far is a link to Distributing Computing for Julia (https://docs.julialang.org/en/v1/stdlib/Distributed), which has so much information I don’t know where to start.)

I can’t be the first person with this problem… is there some straightforward way to modify a program minimally to get it to run on a parallel system like this? (And if so, could someone share some simple coding examples on how to do it?)

Thanks for any info…!

lmiq · February 16, 2021, 1:18am

Hard to tell without any information about your application. But the simplest parallelization model is this: Multi-Threading · The Julia Language

But the first homework is to find out what exactly in your program takes most of the time, to think how that must be parallelized (or accelerated sequentially as well, many times a code can be accelerated hundreds of times before thinking about parallelization).

CosmoProf · February 16, 2021, 1:50am

Hi Leandro, thanks for the reply.

The Multi-Threading link looks interesting, but appears like I might have to go through all of the programs and modify every single loop structure. (!?)

Anyway, the vast majority of my program run-time is spent doing one type of task: solving a PDE using the method-of-lines. In other words, solving a very large (> 1000) set of coupled ODE’s. This, for example, is a set of commands that could take over 24 hours:

prob = ODEProblem(CoupledEqns!, InitConditionsArray, tspan, ParamsTuple)

@time SolvTheProb = DifferentialEquations.solve(prob, Rodas5(), dense=true, 
                    reltol=1e-12, abstol=1e-12, maxiters = 1e7, progress=true)

For example, InitConditionsArray may be an array of 3201 elements, leading to a “sparsely coupled” system of differential equations, where each Nth equation depends also upon the values before it (N-1) and the one after it (N+1).

lmiq · February 16, 2021, 1:58am

Paralellizing coupled ODEs isn’t the easiest thing. But if you provide a minimal working example I am sure you will get good advice here.

(Without knowing anything about the problem, 1e-12 seems a too strict tolerance, given the precision of floating point arithmetics.).

rveltz · February 16, 2021, 7:39am

An ODE of size 1000 is not big. You said that the jacobian is sparse, did you try giving the jacobian sparsity pattern, or provide an analytical jacobian?

Have your read the ODE tips ?

CosmoProf · February 17, 2021, 3:35am

Hi Leandro, a copy of a simplified version is below. As written, the program takes ~3-4 minutes to run with Npix = 201. But the runtime (and memory used) goes up at least as fast as Npix^2, and the program takes days to run with Npix = 3201 (for example). Almost all of the runtime is taken by the final command below, DifferentialEquations.solve:

# Minimal working example to test Multi-threading...
#
using DifferentialEquations
# using OrdinaryDiffEq, DiffEqBase, DiffEqOperators, BandedMatrices
#
Npix = 201
tspan = [1.0,5.0]
ODEparams = tuple(0.04,1.1)
xArr = [(-4.0 + (0.04*(i-1))) for i in 1:Npix]
#
InitCondsArr = Array{Float64}(undef, Npix, 2)
InitCondsArr[1:Npix,2] .= sin.(xArr)
InitCondsArr[1:Npix,1] .= cos.(xArr)
#
function EqnsToSolve!(du,u,p,t)
    Npts = size(u)[1]
    DX = p[1]
    k = p[2]
# Deal with left-most x-value:
    du[1,2] = u[1,1]
    du[1,1] = (-k*u[1,2]) +
              (((-1*u[1,2])+(1*u[(1+1),2]))/(DX^2))
# Deal with right-most x-value:
    du[Npts,2] = u[Npts,1]
    du[Npts,1] = ((-k*u[Npts,2])/2) +
                    (-1)*((((-3/2)*u[Npts,1])+((2)*u[(Npts-1),1]) +
                    ((-1/2)*u[(Npts-2),1]))/(-DX))
# All other values in the x-grid:
    for i in 2:(Npts-1)
        du[i,2] = u[i,1]
        du[i,1] = (-k*u[i,2]) +
                    ((u[(i+1),2]-(2*u[i,2])+u[(i-1),2])/(DX^2))
    end
end
#
ProbToSolve = ODEProblem(EqnsToSolve!, InitCondsArr, tspan, ODEparams)
#
@time SolveTheProb = DifferentialEquations.solve(ProbToSolve, Rodas5(),
    dense=true, reltol=1e-12, abstol=1e-12, maxiters = 1e7, progress=true)

rveltz: I looked through the page you linked me to, but it is extremely complicated. I read about the Jacobian procedures, but my program explicitly specifies how the equations for each value of x are linked together (instead of doing some kind of banded matrices approach), so I’m not sure how those procedures apply here. (Also, those examples had constant coefficients for the differential equations here – like my constants k and DX. But in general, I may have non-constant coefficients that depend on x and t.)

I’m hoping to learn how to “multi-thread” this for a computer system with multiple cores, threads, servers. (Or otherwise rapidly speed up the “solve” step.)
Any detailed suggestions welcome!

rveltz · February 17, 2021, 8:36am

First, you are basically solving a linear Laplace equation.

You could look at ModelingToolkit.jl and https://github.com/SciML/ModelingToolkit.jl/blob/master/test/pde.jl

That’s the point… Perhaps you read too fast Automatic-Sparsity-Detection

I may have non-constant coefficients that depend on x and t .)

How would this prevent from computing a differential?

You can try CVODE_BDF as a solver, it should quite fast. It will estimate the jacobian for you.

CosmoProf · February 18, 2021, 6:30am

rveltz, I took a closer look at your Automatic-Sparsity-Detection link, and attempting to use “Automatic Derivation of Jacobian Functions”, I got an expression too large error for any Npix greater than around 280.

For example, I modified my sample of working code above, by adding the following:

ProbToSolve = ODEProblem(EqnsToSolve!, InitCondsArr, tspan, ODEparams)
AutoJac = modelingtoolkitize(ProbToSolve)
jac_Auto = eval(ModelingToolkit.generate_jacobian(AutoJac)[2])
#
#      Above line gives the error: "syntax: expression too large"
#
f_Auto = ODEFunction(EqnsToSolve!, jac=jac_Auto)
prob_jac_Auto = ODEProblem(f_Auto, InitCondsArr, tspan, ODEparams)
@time sol_jac_Auto = solve(prob_jac_Auto)

I’m hoping to use values of Npix up to at least 3200, so it looks like this doesn’t work!

CosmoProf · February 18, 2021, 6:43am

Also, I am not able to the CVODE_BDF solver, since by using the commands:

using Sundials
ProbToSolve = ODEProblem(EqnsToSolve!, InitCondsArr, tspan, ODEparams)
@time SolveTheProb = DifferentialEquations.solve(ProbToSolve, CVODE_BDF(),
    dense=true, reltol=1e-12, abstol=1e-12, maxiters = 1e7, progress=true)

I get the error:

MethodError: no method matching Sundials.DEOptions(::DataStructures.BinaryHeap{
Float64,Base.Order.ForwardOrdering},::DataStructures.BinaryHeap{Float64,
Base.Order.ForwardOrdering}, ::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::Bool, 
::CallbackSet{Tuple{},Tuple{}}, ::Float64, ::Float64, ::Bool, ::Bool, ::Bool, 
::Bool, ::String, ::typeof(DiffEqBase.ODE_DEFAULT_PROG_MESSAGE), ::Float64)

So I’m 0 for 2 so far.

rveltz · February 18, 2021, 7:21am

Strange for CVODE_BDF. Maybe @ChrisRackauckas has an idea

ChrisRackauckas · February 18, 2021, 12:57pm

Are you not using Float64s? It’s a C++ program so it’s a bit more restrictive.

CosmoProf · February 18, 2021, 1:53pm

I think I’m using Float64… A copy of the whole test program is below. It fails both on JuliaPro 1.4.2-1, and JuliaPro 1.5.3-1. On two different computer systems. Easing the tolerances to reltol=1e-6, abstol=1e-6 did not help. (BTW I have never been able to successfully use CVODE_BDF or lsoda.)

using DifferentialEquations
using OrdinaryDiffEq, DiffEqBase, DiffEqOperators, BandedMatrices
using ModelingToolkit
using Sundials # for CVODE_BDF as a solver!
#
Npix = 81
tspan = [1.0,5.0]
ODEparams = tuple(0.04,1.1)
xArr = [(-4.0 + (0.04*(i-1))) for i in 1:Npix]
#
InitCondsArr = Array{Float64}(undef, Npix, 2)
InitCondsArr[1:Npix,2] .= sin.(xArr)
InitCondsArr[1:Npix,1] .= cos.(xArr)
#
function EqnsToSolve!(du,u,p,t)
    Npts = size(u)[1]
    DX = p[1]
    k = p[2]
# Deal with left-most x-value:
    du[1,2] = u[1,1]
    du[1,1] = (-k*u[1,2]) +
              (((-1*u[1,2])+(1*u[(1+1),2]))/(DX^2))
# Deal with right-most x-value:
    du[Npts,2] = u[Npts,1]
    du[Npts,1] = ((-k*u[Npts,2])/2) +
                    (-1)*((((-3/2)*u[Npts,1])+((2)*u[(Npts-1),1]) +
                    ((-1/2)*u[(Npts-2),1]))/(-DX))
# All other values in the x-grid:
    for i in 2:(Npts-1)
        du[i,2] = u[i,1]
        du[i,1] = (-k*u[i,2]) +
                    ((u[(i+1),2]-(2*u[i,2])+u[(i-1),2])/(DX^2))
    end
end
#
ProbToSolve = ODEProblem(EqnsToSolve!, InitCondsArr, tspan, ODEparams)
#
@time SolveTheProb = DifferentialEquations.solve(ProbToSolve, CVODE_BDF(),
    dense=true, reltol=1e-12, abstol=1e-12, maxiters = 1e7, progress=true)
#

Thanks for any suggestions…

ChrisRackauckas · February 18, 2021, 1:58pm

Try not JuliaPro?

CosmoProf · February 18, 2021, 2:00pm

(?)
I thought I was using JuliaPro.

lmiq · February 18, 2021, 2:30pm

I have no idea what the problem is, but I confirm that the code above returns an error in 1.5.3 (no JuliaPro, no IDE):

julia> include("./cosmoprof.jl")
ERROR: MethodError: no method matching Sundials.DEOptions(::DataStructures.BinaryHeap{Float64,DataStructures.LessThan}, ::DataStructures.BinaryHeap{Float64,DataStructures.LessThan}, ::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::CallbackSet{Tuple{},Tuple{}}, ::Float64, ::Float64, ::Bool, ::Bool, ::Bool, ::Bool, ::String, ::typeof(DiffEqBase.ODE_DEFAULT_PROG_MESSAGE), ::Float64)
Closest candidates are:
  Sundials.DEOptions(::SType, ::TstopType, ::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::CType, ::abstolType, ::reltolType, ::Bool, ::Bool, ::Bool, ::Bool, ::String, ::F5, ::Int64) where {SType, TstopType, CType, reltolType, abstolType, F5} at /home/leandro/.julia/packages/Sundials/2KP0s/src/common_interface/integrator_types.jl:2
__init(::ODEProblem{Array{Float64,2},Tuple{Float64,Float64},true,Tuple{Float64,Float64},ODEFunction{true,typeof(EqnsToSolve!),LinearAlgebra.UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},DiffEqBase.StandardODEProblem}, ::CVODE_BDF{:Newton,:Dense,Nothing,Nothing}, ::Array{Any,1}, ::Array{Any,1}, ::Array{Any,1}; verbose::Bool, callback::Nothing, abstol::Float64, reltol::Float64, saveat::Array{Float64,1}, tstops::Array{Float64,1}, maxiters::Float64, dt::Nothing, dtmin::Float64, dtmax::Float64, timeseries_errors::Bool, dense_errors::Bool, save_everystep::Bool, save_on::Bool, save_start::Bool, save_end::Bool, dense::Bool, progress::Bool, progress_name::String, progress_message::typeof(DiffEqBase.ODE_DEFAULT_PROG_MESSAGE), save_timeseries::Nothing, advance_to_tstop::Bool, stop_at_next_tstop::Bool, userdata::Nothing, alias_u0::Bool, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/leandro/.julia/packages/Sundials/2KP0s/src/common_interface/solve.jl:309
(::DiffEqBase.var"#__init##kw")(::NamedTuple{(:dense, :reltol, :abstol, :maxiters, :progress),Tuple{Bool,Float64,Float64,Float64,Bool}}, ::typeof(DiffEqBase.__init), ::ODEProblem{Array{Float64,2},Tuple{Float64,Float64},true,Tuple{Float64,Float64},ODEFunction{true,typeof(EqnsToSolve!),LinearAlgebra.UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},DiffEqBase.StandardODEProblem}, ::CVODE_BDF{:Newton,:Dense,Nothing,Nothing}, ::Array{Any,1}, ::Array{Any,1}, ::Array{Any,1}) at /home/leandro/.julia/packages/Sundials/2KP0s/src/common_interface/solve.jl:42

ChrisRackauckas · February 18, 2021, 2:32pm

The error says maxiters isn’t an Int. Did you try Int(1e7)?

CosmoProf · February 18, 2021, 3:35pm

Just tried maxiters = Int(1e7), it worked! (For now.) I’ll test running the real (very long) program with the CVODE_BDF solver now. Thanks!

CosmoProf · February 18, 2021, 3:40pm

BTW, does using CVODE_BDF mean that I don’t have to bother with implementing Automatic Derivation of Jacobian Functions – as I described above, where I got an error: "syntax: expression too large" for any Npix > 280? Or would somehow fixing that provide even more speedup & memory savings?

I’m referring to the post above where I tried to implement:

ProbToSolve = ODEProblem(EqnsToSolve!, InitCondsArr, tspan, ODEparams)
AutoJac = modelingtoolkitize(ProbToSolve)
jac_Auto = eval(ModelingToolkit.generate_jacobian(AutoJac)[2])

ChrisRackauckas · February 18, 2021, 3:47pm

Julia cannot compile huge functions. You’ll want to make it sparse in that case.

CosmoProf · February 18, 2021, 4:00pm

Ok, so that’s a limitation we can’t get around…

Regarding sparse Jacobians, I read through the Automatic Sparsity Detection section of “Solving Large Stiff Equations · DifferentialEquations.jl” as much as I could, but I got lost when it started talking about brusselators, and getting arrays of random numbers (for some reason) for arrays called input and output. Any tips on how to apply Sparsity to my example program here, I’d be very grateful.

Topic		Replies	Views
Looking to parallelize (not parametrize) the solution to a large and highly stiff ODE system Julia at Scale multithreading , ode , differentialequation , sundials , parallel-computing	7	263	July 13, 2024
Parellelization for large ODEs, Multithreading fails Performance performance , multithreading , differentialequation	16	326	March 30, 2023
Tutorials or examples on multithreading and distributed processing for numerical PDEs New to Julia	8	1231	June 2, 2021
Specifying ode solver options to speed up compute time New to Julia question	42	1943	June 2, 2021
Multithreaded code on beefy computer runs just as fast as serial code on M1 Mac Performance performance , parallel , differentialequation	41	3721	February 2, 2022

Simplest way to convert a program for parallel (multithreaded) runs on multiple servers/cores?

Related topics