I would like to hide the MPI calls needed to wait for a non-blocking Isend
/ Irecv!
by using the Julia asynchronous programming capabilities. I can imagine 3 main ways of doing this (see code below), but I am not certain which one is the way to go.
Which one is the best? Are there other better alternatives? Any help would be very welcome!
using MPI
MPI.Init()
comm = MPI.COMM_WORLD
# Start a non-blocking exchange
N = 4
send_mesg = Array{Float64}(undef, N)
recv_mesg = Array{Float64}(undef, N)
rank = MPI.Comm_rank(comm)
size = MPI.Comm_size(comm)
dst = mod(rank+1, size)
src = mod(rank-1, size)
fill!(send_mesg, Float64(rank))
rreq = MPI.Irecv!(recv_mesg, src, src+32, comm)
print("$rank: Sending $rank -> $dst = $send_mesg\n")
sreq = MPI.Isend(send_mesg, dst, rank+32, comm)
# Option 1
# Will the underlying c call to MPI_Waitall block the entire julia process?
# Will code have a change to run between @async and wait before the task finishes?)
t = @async begin
stats = MPI.Waitall!([rreq, sreq])
print("$rank: Received $src -> $rank = $recv_mesg\n")
end
# run some code here before the communication is done
wait(t)
# Option 2
# is the while loop efficient? will it provide room for other tasks to run?
t = @async begin
done = false
while !done
done, stats = MPI.Testall!([rreq, sreq])
end
print("$rank: Received $src -> $rank = $recv_mesg\n")
end
# run some code here before the communication is done
wait(t)
# Option 3
# I would say that in this case other code can run between @task and schedule
# but I would like to avoid to explicitly call schedule)
t = @task begin
stats = MPI.Waitall!([rreq, sreq])
print("$rank: Received $src -> $rank = $recv_mesg\n")
end
# run some code here before the communication is done
wait(schedule(t))
1 Like
It would be good to have a way to integrate MPI_Wait
with the libuv event loop used by Julia tasks (green threads), so that Julia tasks can execute while MPI is waiting and so that the waiting task can wake up as soon as the request is available.
Using MPI.Waitall
(options 1 or 3) won’t work, because that call does not return until a request completes. Using a spinloop (option 2) should work as long as you call yield()
in the loop, but is pretty inefficient — it consumes a lot of cycles testing over and over again.
The ideal thing would be if an MPI_Request
could be converted to a file descriptor, since libuv can wait efficiently on file descriptors via poll
, but I don’t see a standard way to do this (even though some MPI implementations may use file descriptors internally for asynchronous requests).
However, MPI_Wait
is thread-safe, along with MPI_Waitall
, so that offers another option. You could spawn a (real) thread to wait on MPI, and when the waiting succeeds you could use uv_async_send
to notify the main libuv (Julia) event loop. See also here and here.
It would be nice if something like this were implemented in MPI.jl, since it’s rather low-level stuff that most users wouldn’t want to muck with directly, though now that Julia supports real threading it should be easier.
2 Likes
MPI also has something called a generalized request that allow you do define custom mechanisms for asynchronous operation. I haven’t read the documentation closely yet, but this might ultimately be the best way to integrate MPI requests with Julia.
1 Like
Thanks for your answers @stevengj!
And what about using @threadcall ? Multi-Threading · The Julia Language
@threadcall
on MPI_Wait
or MPI_Waitall
sounds like a good choice here, too — I didn’t actually know about that macro! Since it only works for ccall
, you’ll have to re-implement MPI.Waitall!
, or submit a patch to MPI.jl — it might be reasonable to have keyword argument to do this.
2 Likes
I’ve added a PR for a @threadcall
-based wait
here: RFC: Define wait(req) to use threadcall by simonbyrne · Pull Request #452 · JuliaParallel/MPI.jl · GitHub. Thoughts/comments appreciated.
If you want to use a Julia thread, you can simply use
t = Threads.@spawn MPI.Waitall!([rreq, sreq])
(though you will need to start Julia with multiple threads, and use MPI.Init_thread
)
3 Likes
In Chmy.jl, we recently combined task-based approach with MPI and async GPU operation (relying on TLS) with the aim to hide MPI communication behind stencil computation. The approach was successful to allow scaling 3D thermo-mechanical Stokes flow on the entire LUMI supercomputer. It combines long running tasks (workers initialised by a launcher) on which task-local exchangers and stack allocator provide the work to achieve the halo exchange where cooperative waiting is needed.
2 Likes