I attempted a simple 2D boundary communication using MPI.jl and CUDA.jl. However, when I used views, the wall time was 100 times slower compared to not using views. I made slight modifications based on the example provided in the MPI documentation.
On the other hand, using views of arrays in Julia did not cause any performance issues.
I have a few questions:
Can I use views with CUDA-aware MPI?
Are there any better communication methods available?
(Additionally, why am I able to communicate with CuArray even when MPI.has_cuda() returns false?)
Thank you!
using MPI
using CUDA
using Statistics
MPI.Init()
comm = MPI.COMM_WORLD
rank = MPI.Comm_rank(comm)
size = MPI.Comm_size(comm)
dst = mod(rank + 1, size)
src = mod(rank - 1, size)
println("rank=$rank, size=$size, dst=$dst, src=$src")
N = 1024
G = 4
A = zeros(N, N)
send_mesg = @views A[1+G:2G, :]
recv_mesg = @views A[1:G, :]
fill!(send_mesg, Float64(rank))
CUDA.synchronize()
nitr = 10
elasped = zeros(Float64, nitr)
for i in 1:nitr
start = MPI.Wtime()
MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm)
finish = MPI.Wtime()
elasped[i] = finish - start
rank === 0 && @show elasped[i]
end
println("Average communication time on proc $rank over $nitr iterations: $(mean(elasped)) seconds")
rank == 0 && println("done.")
MPI.Finalize()
I measured the time with @views and without @views, and using CUDA.zeros or zeros. Please disregard any incorrect code (it may not even be able to communicate with this code).
This did not cause any performance issues.
The reason performance issues occurs is the former layout is non-contiguous, while the latter layout is contiguous.
But I still wonder
Why does’nt non-contiguous CPU subArray occur performance problem?
What should I do when I want to communicate non-contiguous data?
Non contiguous layout (as possibly returned by views) when using GPU-aware MPI is extremely slow; from personal experience, looking at profiler results, it seems that GPU-aware MPI generates a message for each entry of non-contiguous array element.
A work-around is to define send and receive buffers (which can be on GPU memory) and use those for sending/receiving.
MPI.has_cuda() only works for OpenMPI. Other MPI installs may return false but GPU-aware MPI is enabled.
But I found this way is only 2x faster than that without buffer.
I found some paper that MPI pack/unpack API is useful for no contiguous data commnunication.
But It seems that MPI.jl does not export MPI.Pack or MPI.unpack.
Could MPI.jl developers implement this API and optimize GPU data commnunication?
I think 3D decomposition is necessary for high performance GPU computation.
I use cuda-aware OpenMPI through MPIPreference. So MPI.has_cuda() should be true.
Something must still not work as expected given the poor performance you are reporting. I guess you are using system provided MPI to have the GPU-aware functionality. Can you confirm you’re selecting it appropriately given MPI.has_cuda() returns false why it should be true.
Maybe the above MWE is not the most optimal one to assess performance, as one may want ti use non-blocking comm as well.
Regarding MPI.Pack and MPI.Unpack, feel free to open an issue on MPI.jl, to ask for this feature.
Thank you. I tried to find the reason MPI.has_cuda() does not work. But I cannot . I apologize for the delayed response. I’ve been trying various things.
I will issue MPI.Pack and MPI.Unpack.
On your HPC system, it seems you are using the artefact for CUDA instead of the system libs which may potentially explain the fact that CUDA-aware MPI is not working. Make sure to select system CUDA libs Overview · CUDA.jl
@luraess I followed your method and achieved a speedup of 60s → 0.1s in my simulation code. Thanks for sticking with me for this long discussion.
I’ll see what I can do about the other problem. I’m truly thankful for your help!