MPI.jl - Different behaviors using MPIManager and mpirun


#1

I’m getting some interesting behavior using the MPIManager with a parallel optimization solver and I’m trying to rule out whether the MPIManager is doing something different than just using MPI.Init() and running with my system mpirun.

When I use the MPIManager, my solver works and scales how I would expect with more cpus (this is great!). When I use mpirun and execute my script, the solver stalls immediately and eventually fails.

Example with MPIManager:
script1.jl

using MPI
include("my_model_stuff.jl")

manager = MPIManager(np = 2)
addprocs(manager)

my_model = create_my_model()
send_model_to_mpi_ranks(my_model) #this uses a RemoteChannel and @spawnat to send my_model to each rank from manager.mpi2j.  It is called my_model on each rank.
MPI.@mpi_do manager solve(my_model)  #solves the model with my MPI solver. 

Then I run the script

$ julia script1.jl  # solver succeeds! Solver recognizes 2 ranks.

Example without MPIManager
script2.jl

using MPI
include("my_model_stuff.jl")

MPI.Init()
my_model = create_my_model()
solve(my_model) 
MPI.Finalize()

Now run the script with mpirun

$ mpirun -np 2 <path to julia> script2.jl  #solver fails.  It also recognizes the 2 ranks.

Both cases work with -np 1, so I think it has to be something to do with MPI. Is there something fundamentally different with how MPIManager sets up the workers? I looked through its source code, but nothing particularly jumps out at me.

I’m using Julia 1.0.3 and mpich2 v3.2. Both which(mpirun), which(mpicc) and the Julia build script point to the same locations.