Using views of CuArray with CUDA-aware MPI is extremely slow

Hi everyone,

I attempted a simple 2D boundary communication using MPI.jl and CUDA.jl. However, when I used views, the wall time was 100 times slower compared to not using views. I made slight modifications based on the example provided in the MPI documentation.

On the other hand, using views of arrays in Julia did not cause any performance issues.

I have a few questions:

  • Can I use views with CUDA-aware MPI?
  • Are there any better communication methods available?
  • (Additionally, why am I able to communicate with CuArray even when MPI.has_cuda() returns false?)

Thank you!

using MPI
using CUDA
using Statistics

MPI.Init()
comm = MPI.COMM_WORLD
rank = MPI.Comm_rank(comm)
size = MPI.Comm_size(comm)
dst = mod(rank + 1, size)
src = mod(rank - 1, size)
println("rank=$rank, size=$size, dst=$dst, src=$src")
N = 1024
G = 4
A = zeros(N, N)
send_mesg = @views A[1+G:2G, :]
recv_mesg = @views A[1:G, :]
fill!(send_mesg, Float64(rank))
CUDA.synchronize()

nitr = 10
elasped = zeros(Float64, nitr)
for i in 1:nitr
    start = MPI.Wtime()

    MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm)

    finish = MPI.Wtime()

    elasped[i] = finish - start
    rank === 0 && @show elasped[i]
end

println("Average communication time on proc $rank over $nitr iterations: $(mean(elasped)) seconds")
rank == 0 && println("done.")
MPI.Finalize()

I measured the time with @views and without @views, and using CUDA.zeros or zeros. Please disregard any incorrect code (it may not even be able to communicate with this code).

1 Like

I noticed memory layout is problematic.

send_mesg = @views A[:, 1+G:2G]
recv_mesg = @views A[:, 1:G]

This did not cause any performance issues.
The reason performance issues occurs is the former layout is non-contiguous, while the latter layout is contiguous.

But I still wonder

  • Why does’nt non-contiguous CPU subArray occur performance problem?
  • What should I do when I want to communicate non-contiguous data?

Non contiguous layout (as possibly returned by views) when using GPU-aware MPI is extremely slow; from personal experience, looking at profiler results, it seems that GPU-aware MPI generates a message for each entry of non-contiguous array element.

A work-around is to define send and receive buffers (which can be on GPU memory) and use those for sending/receiving.

MPI.has_cuda() only works for OpenMPI. Other MPI installs may return false but GPU-aware MPI is enabled.

1 Like

Thank you for replying.

I tried a proposed way

  1. Copy array to send buffer.
  2. Communicate
  3. Copy receive buffer to array

But I found this way is only 2x faster than that without buffer.

I found some paper that MPI pack/unpack API is useful for no contiguous data commnunication.
But It seems that MPI.jl does not export MPI.Pack or MPI.unpack.
Could MPI.jl developers implement this API and optimize GPU data commnunication?
I think 3D decomposition is necessary for high performance GPU computation.

I use cuda-aware OpenMPI through MPIPreference. So MPI.has_cuda() should be true.

Something must still not work as expected given the poor performance you are reporting. I guess you are using system provided MPI to have the GPU-aware functionality. Can you confirm you’re selecting it appropriately given MPI.has_cuda() returns false why it should be true.

Maybe the above MWE is not the most optimal one to assess performance, as one may want ti use non-blocking comm as well.

Regarding MPI.Pack and MPI.Unpack, feel free to open an issue on MPI.jl, to ask for this feature.

1 Like

Thank you. I tried to find the reason MPI.has_cuda() does not work. But I cannot . I apologize for the delayed response. I’ve been trying various things.
I will issue MPI.Pack and MPI.Unpack.

Perhaps this van be useful GitHub - omlins/julia-gpu-course: GPU Programming with Julia - course at the Swiss National Supercomputing Centre (CSCS), ETH Zurich

1 Like

No worries. Well, I guess one should make sure that CUDA-aware MPI is really functional and maybe use a better MWE to assess point-to-point comm perf.

Can you report what MPI.versioninfo() and CUDA.versioninfo() are showing?

Thank you. But at a glance, 2D decomposition is not treated maybe.

Somehow MPI.has_cuda() works in local. But I do not remember What I did.
The only thing I remember is that I set vscode setting

{
  "terminal.integrated.env.linux": {
    "JULIA_PKG_USE_CLI_GIT": "true",
    "UCX_WARN_UNUSED_ENV_VARS": "n",
    "UCX_ERROR_SIGNALS": "SIGILL,SIGBUS,SIGFPE"
  }
}

Just to be sure I will share versioninfos.

CUDA runtime 12.5, artifact installation
CUDA driver 12.5
NVIDIA driver 555.58.2

CUDA libraries: 
- CUBLAS: 12.5.3
- CURAND: 10.3.6
- CUFFT: 11.2.3
- CUSOLVER: 11.6.3
- CUSPARSE: 12.5.1
- CUPTI: 2024.2.1 (API 23.0.0)
- NVML: 12.0.0+555.58.2

Julia packages: 
- CUDA: 5.4.3
- CUDA_Driver_jll: 0.9.1+1
- CUDA_Runtime_jll: 0.14.1+0

Toolchain:
- Julia: 1.10.4
- LLVM: 15.0.7

1 device:
  0: NVIDIA GeForce RTX 4070 Ti (sm_89, 10.789 GiB / 11.994 GiB available)
MPIPreferences:
  binary:  system
  abi:     OpenMPI
  libmpi:  libmpi
  mpiexec: mpiexec

Package versions
  MPI.jl:             0.20.20
  MPIPreferences.jl:  0.1.11

Library information:
  libmpi:  libmpi
  libmpi dlpath:  /usr/lib/libmpi.so
  MPI version:  3.1.0
  Library version:  
    Open MPI v5.0.4, package: Open MPI builduser@buildhost Distribution, ident: 5.0.4, repo rev: v5.0.4, Jul 18, 2024
MPI.has_cuda() = true

But in HPC, MPI.has_cuda() stiil does not work.

MPI.has_cuda() = false
MPIPreferences:
  binary:  system
  abi:     OpenMPI
  libmpi:  libmpi
  mpiexec: mpiexec

Package versions
  MPI.jl:             0.20.20
MPIPreferences:
  binary:  system
  abi:     OpenMPI
  libmpi:  libmpi
  mpiexec: mpiexec

Package versions
  MPI.jl:             0.20.20
  MPIPreferences.jl:  0.1.11

Library information:
  libmpi:  libmpi
  libmpi dlpath:  /apps/t4/rhel9/free/openmpi/5.0.2/nvhpc/lib/libmpi.so
  MPI version:  3.1.0
  Library version:  
    Open MPI v5.0.2, package: Open MPI root@login1 Distribution, ident: 5.0.2, repo rev: v5.0.2, Feb 06, 2024
  MPIPreferences.jl:  0.1.11

Library information:
  libmpi:  libmpi
  libmpi dlpath:  /apps/t4/rhel9/free/openmpi/5.0.2/nvhpc/lib/libmpi.so
  MPI version:  3.1.0
  Library version:  
    Open MPI v5.0.2, package: Open MPI root@login1 Distribution, ident: 5.0.2, repo rev: v5.0.2, Feb 06, 2024
CUDA runtime 12.5, artifact installation
CUDA driver 12.3
CUDA runtime 12.5, artifact installation
CUDA driver 12.3
NVIDIA driver 545.23.8

CUDA libraries: 
- CUBLAS: 12.3.4
NVIDIA driver 545.23.8

CUDA libraries: 
- CURAND: 10.3.6
- CUBLAS: 12.3.4
- CUFFT: 11.2.3
- CURAND: 10.3.6
- CUSOLVER: 11.6.3
- CUFFT: 11.2.3
- CUSPARSE: 12.5.1
- CUPTI: 2024.2.1 (API 23.0.0)
- CUSOLVER: 11.6.3
- NVML: 12.0.0+545.23.8

Julia packages: 
- CUDA: 5.4.3
- CUDA_Driver_jll: 0.9.1+1
- CUDA_Runtime_jll: 0.14.1+0

Toolchain:
- Julia: 1.10.4
- LLVM: 15.0.7

- CUSPARSE: 12.5.1
- CUPTI: 2024.2.1 (API 23.0.0)
1 device:
- NVML: 12.0.0+545.23.8

Julia packages: 
- CUDA: 5.4.3
- CUDA_Driver_jll: 0.9.1+1
- CUDA_Runtime_jll: 0.14.1+0

Toolchain:
- Julia: 1.10.4
- LLVM: 15.0.7

1 device:
  0: NVIDIA H100 (sm_90, 93.004 GiB / 93.584 GiB available)
1 Like

Additionary, I tried MEW using Irecv! and Isend. But still problem occurs.

using MPI
using CUDA
using Statistics

MPI.Init()
comm = MPI.COMM_WORLD
rank = MPI.Comm_rank(comm)
size = MPI.Comm_size(comm)
dest = mod(rank + 1, size)
source = mod(rank - 1, size)
println("rank=$rank, size=$size, dest=$dest, source=$source")
N = 1024
G = 4
A = CUDA.zeros(N, N)
send_mesg = @views A[1+G:2G, :]
recv_mesg = @views A[1:G, :]
fill!(send_mesg, Float64(rank))
CUDA.synchronize()

nitr = 10
elasped = zeros(Float64, nitr)
for i in 1:nitr
    start = MPI.Wtime()

    MPI.Waitall([
        MPI.Isend(send_mesg, comm; dest),
        MPI.Irecv!(recv_mesg, comm; source)
    ])

    finish = MPI.Wtime()

    elasped[i] = finish - start
    rank === 0 && @show elasped[i]
end

println("Average communication time on proc $rank over $nitr iterations: $(mean(elasped[2:end])) seconds")
rank == 0 && println("done.")
MPI.Finalize()

On your HPC system, it seems you are using the artefact for CUDA instead of the system libs which may potentially explain the fact that CUDA-aware MPI is not working. Make sure to select system CUDA libs Overview · CUDA.jl

1 Like

Thank you.
I modified LocalPreference to use local nvidia-toolkit.
And I got

MPI.has_cuda() = false
MPIPreferences:
  binary:  system
  abi:     OpenMPI
  libmpi:  libmpi
  mpiexec: mpiexec

Package versions
  MPI.jl:             0.20.20
  MPIPreferences.jl:  0.1.11

Library information:
  libmpi:  libmpi
  libmpi dlpath:  /apps/t4/rhel9/free/openmpi/5.0.2/nvhpc/lib/libmpi.so
  MPI version:  3.1.0
  Library version:  
    Open MPI v5.0.2, package: Open MPI root@login1 Distribution, ident: 5.0.2, repo rev: v5.0.2, Feb 06, 2024
CUDA runtime 12.3, local installation
CUDA driver 12.3
NVIDIA driver 545.23.8

CUDA libraries: 
- CUBLAS: 12.3.4
- CURAND: 10.3.4
- CUFFT: 11.0.12
- CUSOLVER: 11.5.4
- CUSPARSE: 12.2.0
- CUPTI: 2023.3.1 (API 21.0.0)
- NVML: 12.0.0+545.23.8

Julia packages: 
- CUDA: 5.4.3
- CUDA_Driver_jll: 0.9.1+1
- CUDA_Runtime_jll: 0.14.1+0
- CUDA_Runtime_Discovery: 0.3.4

Toolchain:
- Julia: 1.10.4
- LLVM: 15.0.7

Preferences:
- CUDA_Runtime_jll.version: 12.3
- CUDA_Runtime_jll.local: true

1 device:
  0: NVIDIA H100 (sm_90, 93.004 GiB / 93.584 GiB available)

I suspect ENV so tried

UCX_WARN_UNUSED_ENV_VARS=n UCX_ERROR_SIGNALS="SIGILL,SIGBUS,SIGFPE" mpiexec -n 4 julia --project tmp.jl

But still MPI.has_cuda() = false.

1 Like

@luraess I followed your method and achieved a speedup of 60s → 0.1s in my simulation code. Thanks for sticking with me for this long discussion.
I’ll see what I can do about the other problem. I’m truly thankful for your help!

2 Likes