Trouble with DifferentialEquations.jl using MPI with the integrator interface

I’m using DIfferentialEquations.jl to solve a very large (10^9) system of coupled ODEs. I’m using MPI because the memory required for the full system won’t find on a workstation. The problem fits pretty naturally with the integrator interface and everything works well until the problem starts to get stiffer.

In general the MPI workers don’t automatically share the same dt so they tend to get out of step with each other. This can be handled by manually setting the dt outside of the step!() function. A problem seems to come up when some but not all of the MPI workers take a successful step. When this happens the workers that failed the success check circle back to take a smaller step while the workers that think the step was successful move on with the code. Once the workers are out of step with each other the code just hangs forever.

I’ve put together a fairly minimal working example to demonstrate the problem. The problem is two coupled oscillators with one MPI process for each oscillator. I wanted to make sure that the test problem was small enough that it would work on any computer with at least two cores. The serial version runs for 1000 steps with no problem. The MPI version goes about 700 steps and then hangs. From the terminal output I can see one worker reporting that it is done with the step!() function while the other is indicating that it is calling the mpi_coupled_oscillators!() function again which suggests that it is trying again instead of exiting the step!() function.

So I think the problem is that the MPI workers and tracking error/time steps/success independently but I’m not sure where the best place to try to change that is. Any and all ideas are greatly appreciated!

using DifferentialEquations
p = (1,1e8,1e-1)  # (spring constant for oscillator 1, spring constant for oscillator 2, coupling term)

# test system done in serial to make sure the problem is solvable using a given ODE solver.
function coupled_oscillators!(du,u,p,t)
 du[1] = u[2]
 du[2] = -(p[1] + p[3]) * u[1] + p[3]*u[3]
 du[3] = u[4]
 du[4] = -(p[2] + p[3]) * u[3] + p[3]*u[1]
end

# set up the problem
u0 = [1.0;  0.0; -0.5; -0.5]
tspan = (0.0,100.0)
prob = ODEProblem(coupled_oscillators!,u0,tspan,p)
# define single step integrator
integrator = init(prob,Tsit5(), save_everystep = false, abstol = 1e-8, reltol = 1e-8)

# record that results as a text file since MPI precludes running interactively as far as I can tell.
# creates a text file with 5 columns (time, x1, v1, x2, v2)
io_all = open("all.txt","w")

# run the problem for some 1000 steps and record values after every step.
for i = 1:1000
	step!(integrator) #take one step the jump out to record data
    println(io_all,  string(integrator.t)*" "*
        string(integrator.u[1])*" "* 
        string(integrator.u[2])*" "* 
        string(integrator.u[3])*" "*
        string(integrator.u[4])
        )
end

####### mpi version

using MPI
MPI.Init()
using DifferentialEquations

const comm = MPI.COMM_WORLD
const mpi_size = MPI.Comm_size(comm)
const mpi_rank = MPI.Comm_rank(comm)

function mpi_coupled_oscillators!(du,u,p,t)
    println("worker "*string(mpi_rank)*" dt is "* string(integrator.dt))#Tells you when each worker is inside the oscillator function and what dt is being used.
    # have the two workers exchange information
    if mpi_rank == 0
        MPI.Isend(u, 1, 1, comm)
    end
    
    if mpi_rank == 1
        MPI.Isend(u, 0, 2, comm)
    end
    
    MPI.Barrier(comm)
    u_other = similar(u) #allocate space for input from the other worker

    # get the input from the other worker
    if mpi_rank == 0
        MPI.Irecv!(u_other, 1, 2, comm)
    end
    
    if mpi_rank == 1
        MPI.Irecv!(u_other, 0, 1, comm)
    end

    # update the du terms. MPI version of serial problem setup
    if mpi_rank == 0
        du[1] = u[2]
        du[2] = -(p[1] + p[3]) * u[1] + p[3]*u_other[1]
    end

    if mpi_rank == 1
        du[1] = u[2]
        du[2] = -(p[2] + p[3]) * u[1] + p[3]*u_other[1]
   end
end

# set up the problem with the same initial conditions as the serial version
if mpi_rank == 0
    mpi_u0 = [1.0; 0.0]
end

if mpi_rank == 1
    mpi_u0 = [-0.5; -0.5]
end

prob = ODEProblem(mpi_coupled_oscillators!,mpi_u0,tspan,p)
integrator = init(prob,Tsit5(), save_everystep = false, abstol = 1e-8, reltol = 1e-8) # parallel version of single step integrator
# for the mpi output files, each one has 3 columns (t1, x1, v1) and (t2, x2, v2)
io_out = open("part_"*string(mpi_rank)*".txt","w")
close(io_out)

# try to run for the same number of steps as the serial version.
for i = 1:1000
    #try to force the solver to take the smaller of the two proposed steps to keep the processors aligned with eachother
    integrator.dt = MPI.Allreduce!([integrator.dt], min, comm)[1] 
    integrator.dtpropose = MPI.Allreduce!([integrator.dtpropose], min, comm)[1]

	step!(integrator) #take one step then jump out to record data
    println("Step "*string(i)*" worker "*string(mpi_rank)*" ready  ") #processors report when they have successfully taken a step and are ready to move on
    io_out = open("part_"*string(mpi_rank)*".txt","a")
    println(io_out,  string(integrator.t)*" "*
        string(integrator.u[1])*" "* 
        string(integrator.u[2])
    )
    close(io_out)
end

You almost certainly want to be doing this with a MPI-based array type and one integrator

1 Like

The real version is using PencilArrays.jl which are MPI arrays designed to work with FFTW and I see similar behavior. I was just going for an easy test problem for illustration. Your point about the multiple integrators I think probably is the key to the problem, although I can’t quite get my head around how to make it work in an MPI environment.

I can set he pencil arrays up for the problem, but then when I declare the integrator I face a dilemma: The kind of dumb way I did it results in every worker having their own integrator and fails whenever one of them fails an error check, but if only one worker creates the integrator then the code fails immediately.

if mpi_rank == 0
       integrator = ...
       step!(integrator)
end

fails because the other workers never see the call to the function so the boss node waits forever at the start of the rhs function.

if mpi_rank == 0
      integrator = ...
end
step!(integrator)

will fail because all of the other workers see the integrator as an undefined variable. Hopefully I’m missing something obvious and someone can point it out. If not, I’ll keep playing with it and see where I get.

Does PencilArrays.jl define an appropriate broadcast overload? And reductions?

1 Like

Broadcasts are properly defined as far as I can tell, but there is no special treatment for reductions. This likely means that reductions are performed locally on each process, without communication.

It should be easy, and I think it would make sense, to overload reductions in PencilArrays.jl so that they are global, using something like MPI.Allreduce behind the scenes. I guess this would fix @reidr’s issue.

Besides, I think the original approach of one integrator per MPI process would make more sense in MPI-style programming, as one usually wants to minimise communications (especially when working with thousands of processes). In particular, communications in PencilArrays are explicit as much as possible.

It would still be explicit, it’s just using the array interface for explicit communication in an abstract generic way. The other way to do it would be to either use local adaptive time stepping to tstops to a fixed dt, somewhat like parallel in time methods, or fixed time step on each piece and then reduce some norm to update a global time step.

PencilArrays.jl v0.11.0 is now being registered. In this version, Julia’s reduction operations such as minimum or maximum (actually, anything that uses mapreduce) are now computed globally (see PR).

Assuming that DifferentialEquations.jl calls something like maximum for step size control, this will hopefully fix the issue. @reidr let me know if it works after updating PencilArrays.jl.

Hi @reidr

Take a took at GitHub - fverdugo/PartitionedArrays.jl: Vectors and sparse matrices partitioned into pieces for parallel distributed-memory computations. perhaps it is useful for your use case, specially if you need to distribute a sparse Jacobian matrix.

2 Likes

I don’t think local time stepping is workable in this case because the rhs function involves forward and backward FFTs so none of the information is truly local. I think you would need a lot of auxiliary arrays to store consistent information which would get really expensive in memory and communication.

1 Like

Sorry for the late reply, I wanted to check some corner cases before I posted and then was out of town for several weeks. As far as I can tell it is working great on my workstation. The hanging issue is gone and the system advances.

I’m having trouble migrating it to our local cluster, which is odd because it is using the same version MPI, just on several nodes instead of one like my workstation. It runs into trouble the first time I try to take in FFT. I’m not sure it’s a problem with PencilArrays since it is working fine on my workstation. I’ve copied the error in case you see anything obvious.

signal (15): Terminated
in expression starting at /scratch/me/GUPPE/mpi_guppe.jl:56
__GI_memset at /lib64/libc.so.6 (unknown line)
array_resize_buffer at /buildworker/worker/package_linux64/build/src/array.c:728
jl_array_grow_at_end at /buildworker/worker/package_linux64/build/src/array.c:911 [inlined]
jl_array_grow_end at /buildworker/worker/package_linux64/build/src/array.c:975
_growend! at ./array.jl:888 [inlined]
resize! at ./array.jl:1108 [inlined]
transpose_impl! at /home/me/.julia/packages/PencilArrays/5k2hi/src/Transpositions/Transpositions.jl:286
macro expansion at /home/me/.julia/packages/PencilArrays/5k2hi/src/Transpositions/Transpositions.jl:147 [inlined]
macro expansion at /home/me/.julia/packages/TimerOutputs/SSeq1/src/TimerOutput.jl:252 [inlined]
#transpose!#4 at /home/me/.julia/packages/PencilArrays/5k2hi/src/Transpositions/Transpositions.jl:146 [inlined]
transpose!##kw at /home/me/.julia/packages/PencilArrays/5k2hi/src/Transpositions/Transpositions.jl:145 [inlined]
_apply_plans_out_of_place! at /home/me/.julia/packages/PencilFFTs/QKlxm/src/operations.jl:187
unknown function (ip: 0x2b6e5a574d12)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2237 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2419
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1703 [inlined]
do_apply at /buildworker/worker/package_linux64/build/src/builtins.c:670
_apply_plans_out_of_place! at /home/me/.julia/packages/PencilFFTs/QKlxm/src/operations.jl:200
_apply_plans! at /home/me/.julia/packages/PencilFFTs/QKlxm/src/operations.jl:144 [inlined]
macro expansion at /home/me/.julia/packages/PencilFFTs/QKlxm/src/operations.jl:27 [inlined]
macro expansion at /home/me/.julia/packages/TimerOutputs/SSeq1/src/TimerOutput.jl:252 [inlined]
mul! at /home/me/.julia/packages/PencilFFTs/QKlxm/src/operations.jl:25 [inlined]
real2spec! at /scratch/me/GUPPE/utility_functions/forward_backward_transforms.jl:17
unknown function (ip: 0x2b6e5a55bf9c)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2237 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2419
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1703 [inlined]
do_call at /buildworker/worker/package_linux64/build/src/interpreter.c:115
eval_value at /buildworker/worker/package_linux64/build/src/interpreter.c:204
eval_stmt_value at /buildworker/worker/package_linux64/build/src/interpreter.c:155 [inlined]
eval_body at /buildworker/worker/package_linux64/build/src/interpreter.c:562
jl_interpret_toplevel_thunk at /buildworker/worker/package_linux64/build/src/interpreter.c:670
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:877
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:825
jl_toplevel_eval_in at /buildworker/worker/package_linux64/build/src/toplevel.c:929
eval at ./boot.jl:360 [inlined]
include_string at ./loading.jl:1116
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2237 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2419
_include at ./loading.jl:1170
include at ./Base.jl:386
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2237 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2419
exec_options at ./client.jl:285
_start at ./client.jl:485
jfptr__start_43689.clone_1 at /home/me/julia-1.6.3/lib/julia/sys.so (unknown line)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2237 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2419
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1703 [inlined]
true_main at /buildworker/worker/package_linux64/build/src/jlapi.c:560
repl_entrypoint at /buildworker/worker/package_linux64/build/src/jlapi.c:702
main at /buildworker/worker/package_linux64/build/cli/loader_exe.c:51
__libc_start_main at /lib64/libc.so.6 (unknown line)
_start at /home/me/julia-1.6.3/bin/julia

I’m glad to know that it works locally!

signal (15): Terminated
in expression starting at /scratch/me/GUPPE/mpi_guppe.jl:56
__GI_memset at /lib64/libc.so.6 (unknown line)
array_resize_buffer at /buildworker/worker/package_linux64/build/src/array.c:728
jl_array_grow_at_end at /buildworker/worker/package_linux64/build/src/array.c:911 [inlined]
jl_array_grow_end at /buildworker/worker/package_linux64/build/src/array.c:975
_growend! at ./array.jl:888 [inlined]
resize! at ./array.jl:1108 [inlined]
transpose_impl! at /home/me/.julia/packages/PencilArrays/5k2hi/src/Transpositions/Transpositions.jl:286

That’s strange, the error occurs when PencilArrays tries to increase the size of an array via resize!. The array is an internal UInt8 vector used when transposing (rearranging) distributed data among the MPI processes. This kind of operation indeed happens when you take an FFT.

The error seems to be unrelated to PencilArrays, and I guess it could be reproduced by just calling something like resize!(UInt8[], 100) on your cluster. According to the error message, the problem is this memset call, which I guess happens when Julia sets to zero the newly-allocated storage of an UInt8 array (which satisfies the elsz == 1 condition in the linked C code).