I’m on a cluster where MPI transport is significantly faster than TCP transport, and I’d like to keep my interactive Jupyter-based workflow for parallel work. A key thing is I don’t want to be doing any intercommunication “by hand”, I just want to use normal Julia Distributed constructs like @everywhere
, pmap
, etc… and have those be transfering objects for me (via MPI transport).
MPIClusterManagers.jl has MPIManager
which can set up something like:
1) MPI
__________
/ | worker |
kernel - | worker |
\ | worker |
----------
but the kernel is not part of the MPI pool so distributing work from the kernel happens via slow TCP.
Something better (dask-mpi and ipyparallel have something like this I believe):
2) MPI
_______________________
| / worker |
kernel - | controller - worker |
| \ worker |
-----------------------
where theres one slow TCP send to the controller but then data is scatterred via fast MPI by the controller to the workers. Is there anything like this in Julia?
Finally, maybe the most ideal thing (but maybe hardest) to me seems like if the Jupyter kernel could just be part of the MPI pool, like:
3) MPI
_____________________
| / worker |
| kernel - worker |
| \ worker |
---------------------
I actually hacked together something like this which works by making a custom kernel.json file, but its pretty brittle and hangs when the kernel is shutdown / restarted and not super usable. Is anyone aware of something like this done better by someone?
Thanks.
2 Likes
Unfortunately, there isn’t a great way to do this, though option 3 sounds like the simplest way forward. This does sound like a good item for the JuliaHPC call if you would to join.
1 Like
Thanks for the tip, will try to join!
Option 3 should also be doable at NERSC, we allow for Jupyter jobs where the kernel runs in an srun
job. I can talk with some of the folks that have set it up.
1 Like
There is a function in the IJulia.jl build script for adding custom Jupyter kernels:
https://github.com/JuliaLang/IJulia.jl/blob/master/deps/kspec.jl
It might be possible to use or adapt that.
Thanks (sorry I couldn’t join today in the end). For definiteness, here’s my suboptimal solution for (3):
# mpi_kernel.jl
using Distributed, MPI, MPIClusterManagers
MPI.Init()
manager = MPIClusterManagers.start_main_loop(
MPI_TRANSPORT_ALL,
stdout_to_master=true,
stderr_to_master=true
)
include("/global/homes/m/marius/.julia/packages/IJulia/e8kqU/src/kernel.jl")
and
# kernel.json
{
"display_name": "Julia 1.6.1 (MPI)",
"argv": [
"srun",
"/global/u1/m/marius/src/julia-1.6.1/bin/julia",
"-i",
"--color=yes",
"--project=@.",
"/global/u1/m/marius/.local/share/jupyter/kernels/julia-1.6-mpi/mpi_kernel.jl",
"{connection_file}"
],
"language": "julia",
"env": {},
"interrupt_mode": "signal"
}
So basically the kernel srun’s the mpi_kernel.jl
file, rank-0 becomes the actual kernel which calls the IJulia file that would have originally been called, and the other ranks connect back as workers. It works but the downsides are:
- Its a different kernel so gets saved to the notebook file, so if you open the same notebook outside of an MPI environment it’ll basically crash.
- There’s not any way to control how many workers you get, you just get the entire allocation that the job was submitted with. It would be nicer if you could somehow delay spawning the workers until you’re actually in the notebook and could choose how many you want.
- It doesn’t shut down cleanly, so to restart it, you have to resubmit your entire batch job (or maybe do some
kill
ing, haven’t figured out exactly how; ths is probably the most annoying issue)
If these were solved, definitely I think you could use that IJulia code to automate making such kernels.
Are you running the Jupyter session from within an salloc
? or does the srun
do the allocation?
From within (via a JupyterHub instance managed by them).
I did a little of reading the other day and I think one way to fix my 2nd issue above is to only initially spawn the kernel with one MPI process, then provide a command to spawn additional workers which would call MPI_Comm_spawn
, and then use MPI_Intercomm_merge
to merge the newly created comm back into the main global one which (presumably) is being used by Julia to serialize objects to workers. I may have missed some hitch in this plan though.
1 Like
I think the best solution in this case would look similar to LLNL’s bridge kernel, which they explain a bit in this paper (it may even be possible to reuse some of those parts).