I’m getting some interesting behavior using the MPIManager with a parallel optimization solver and I’m trying to rule out whether the MPIManager is doing something different than just using MPI.Init() and running with my system mpirun.
When I use the MPIManager, my solver works and scales how I would expect with more cpus (this is great!). When I use mpirun and execute my script, the solver stalls immediately and eventually fails.
Example with MPIManager:
script1.jl
using MPI
include("my_model_stuff.jl")
manager = MPIManager(np = 2)
addprocs(manager)
my_model = create_my_model()
send_model_to_mpi_ranks(my_model) #this uses a RemoteChannel and @spawnat to send my_model to each rank from manager.mpi2j. It is called my_model on each rank.
MPI.@mpi_do manager solve(my_model) #solves the model with my MPI solver.
Then I run the script
$ julia script1.jl # solver succeeds! Solver recognizes 2 ranks.
Example without MPIManager
script2.jl
using MPI
include("my_model_stuff.jl")
MPI.Init()
my_model = create_my_model()
solve(my_model)
MPI.Finalize()
Now run the script with mpirun
$ mpirun -np 2 <path to julia> script2.jl #solver fails. It also recognizes the 2 ranks.
Both cases work with -np 1, so I think it has to be something to do with MPI. Is there something fundamentally different with how MPIManager sets up the workers? I looked through its source code, but nothing particularly jumps out at me.
I’m using Julia 1.0.3 and mpich2 v3.2. Both which(mpirun)
, which(mpicc)
and the Julia build script point to the same locations.