CUDA aware MPI fails but runs on multiple GPUs

I am running the library ImplicitGlobalGrid.jl on a server to learn about multi GPU computing. The server has MPI and CUDA but when I try MPI.has_cuda() I get false, which tells me that the MPI is not CUDA aware.

However, when I run one of the GPU examples in the repo above, and I specify that I want it to run on 2 gpus using salloc, it runs. You can see the results below (with a coarse grid since I wanted it to run fast).

If a system does not have CUDA-aware MPI then can it still run on multiple GPUs? I was actually hoping to get an error to see what was wrong, but no error occurred.

$ mpiexec -np 2 julia --project diffusion3D_multigpu_CuArrays_
diffusion3D_multigpu_CuArrays_novis.jl    diffusion3D_multigpu_CuArrays_onlyvis.jl  
[fpoulin@cdr353 examples]$ mpiexec -np 2 julia --project diffusion3D_multigpu_CuArrays_novis.jl 
┌ Warning: The NVIDIA driver on this system only supports up to CUDA 11.1.0.
│ For performance reasons, it is recommended to upgrade to a driver that supports CUDA 11.2 or higher.
└ @ CUDA ~/.julia/packages/CUDA/lwSps/src/initialization.jl:42
┌ Warning: The NVIDIA driver on this system only supports up to CUDA 11.1.0.
│ For performance reasons, it is recommended to upgrade to a driver that supports CUDA 11.2 or higher.
└ @ CUDA ~/.julia/packages/CUDA/lwSps/src/initialization.jl:42
Global grid: 30x16x16 (nprocs: 2, dims: 2x1x1)

I have learned maybe an obvious things that even if a system does have CUDA aware MPI it can still run on multiple GPUs, however, the efficiency will in general not be as goodl. Sorry for the bother.

CUDA-aware MPI just determines whether or not you can use CuArrays directly as MPI communication buffers: if your MPI is not CUDA-aware, you will have to first copy the contents to an Array and use that as the buffer.

3 Likes

Hi @francispoulin, thank you for your enthusiastic feedback about ImplicitGlobalGrid.jl. As you figured out, CUDA-aware MPI is an additional feature that permits to exchange directly GPU array pointers via MPI (bypassing explicit buffer copying to the host prior to exchanging host arrays with MPI). In ImplicitGlobalGrid non CUDA-aware MPI is the default implementation where special care was taken to optimise pipelining for optimal performance. Every MPI process controls one GPU and so you can run multi-GPU application scaling on supercomputers even without CUDA-aware capabilites.

If the MPI build on your cluster supports CUDA-awareness in the future, then exporting IGG_CUDAAWARE_MPI=1 will enable it.

1 Like

Note that some of the ImplicitGlobalGrid related capabilities, implementations and synergies with ParallelStencil.jl will be discussed in the JuliaCon workshop on Solving differential equations in parallel on GPUs on Friday July 23, 2021. Tune in if curious :slight_smile:

1 Like

Thank you @simonbyrne and @luraess for your helpful feedback.

I will most certainly checkout the Solving DEs in Parallel on GPUs on Friday!

1 Like