Question about CUDA-aware MPI

I’d like to make sure I understand how Julia hooks in to CUDA-aware MPI. Basically, is it true that when send/receiving Julia objects which may have a CuArray somewhere in them, that the CuArray is never moved to the CPU, and instead passed from GPU to GPU?

Here’s an example script:

using MPI
MPI.Init()

using CuArrays
using CUDAdrv
using CUDAnative

comm = MPI.COMM_WORLD
rank = MPI.Comm_rank(comm)
device!(rank)

@info "MPI process $rank is using $(device())"

if rank == 0

    dat = (x = cu(ones(4,4)),)
    MPI.send(dat, 1, 0, comm)

else     

    dat, = MPI.recv(0, 0, comm)
    @show dat

end

When I mpiexec -n 2 this script on a machine with 2 GPUs, does the CuArray inside dat ever have to go through CPU? Is there a way to check? Thanks.

There is no easy way to check :), but the fact that your code runs without crashing is a good indication.
I am not sure we are handling the situation where a CuArray is wrapped in a tuple right now…,

The current version of MPI.jl has a MPI.has_cuda() function with which you can check the availability of CUDA-aware MPI.

A good indication is to use the nvidia system profiler with the MPI integration and check whether or not you are seeing unnecessary copies to the host.