Using views of CuArray with CUDA-aware MPI is extremely slow

0samuraiE · July 22, 2024, 3:02pm

Hi everyone,

I attempted a simple 2D boundary communication using MPI.jl and CUDA.jl. However, when I used views, the wall time was 100 times slower compared to not using views. I made slight modifications based on the example provided in the MPI documentation.

On the other hand, using views of arrays in Julia did not cause any performance issues.

I have a few questions:

Can I use views with CUDA-aware MPI?
Are there any better communication methods available?
(Additionally, why am I able to communicate with CuArray even when MPI.has_cuda() returns false?)

Thank you!

using MPI
using CUDA
using Statistics

MPI.Init()
comm = MPI.COMM_WORLD
rank = MPI.Comm_rank(comm)
size = MPI.Comm_size(comm)
dst = mod(rank + 1, size)
src = mod(rank - 1, size)
println("rank=$rank, size=$size, dst=$dst, src=$src")
N = 1024
G = 4
A = zeros(N, N)
send_mesg = @views A[1+G:2G, :]
recv_mesg = @views A[1:G, :]
fill!(send_mesg, Float64(rank))
CUDA.synchronize()

nitr = 10
elasped = zeros(Float64, nitr)
for i in 1:nitr
    start = MPI.Wtime()

    MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm)

    finish = MPI.Wtime()

    elasped[i] = finish - start
    rank === 0 && @show elasped[i]
end

println("Average communication time on proc $rank over $nitr iterations: $(mean(elasped)) seconds")
rank == 0 && println("done.")
MPI.Finalize()

I measured the time with @views and without @views, and using CUDA.zeros or zeros. Please disregard any incorrect code (it may not even be able to communicate with this code).

gist.github.com

https://gist.github.com/luraess/0063e90cb08eb2208b7fe204bbd90ed2

alltoall_test_cuda.jl

using MPI
using CUDA
MPI.Init()
comm = MPI.COMM_WORLD
rank = MPI.Comm_rank(comm)
size = MPI.Comm_size(comm)
dst = mod(rank+1, size)
src = mod(rank-1, size)
println("rank=$rank, size=$size, dst=$dst, src=$src")
N = 4

This file has been truncated. show original

0samuraiE · July 23, 2024, 7:21am

I noticed memory layout is problematic.

send_mesg = @views A[:, 1+G:2G]
recv_mesg = @views A[:, 1:G]

This did not cause any performance issues.
The reason performance issues occurs is the former layout is non-contiguous, while the latter layout is contiguous.

But I still wonder

Why does’nt non-contiguous CPU subArray occur performance problem?
What should I do when I want to communicate non-contiguous data?

luraess · July 23, 2024, 8:11pm

Non contiguous layout (as possibly returned by views) when using GPU-aware MPI is extremely slow; from personal experience, looking at profiler results, it seems that GPU-aware MPI generates a message for each entry of non-contiguous array element.

A work-around is to define send and receive buffers (which can be on GPU memory) and use those for sending/receiving.

MPI.has_cuda() only works for OpenMPI. Other MPI installs may return false but GPU-aware MPI is enabled.

0samuraiE · July 23, 2024, 10:07pm

Thank you for replying.

I tried a proposed way

Copy array to send buffer.
Communicate
Copy receive buffer to array

But I found this way is only 2x faster than that without buffer.

I found some paper that MPI pack/unpack API is useful for no contiguous data commnunication.
But It seems that MPI.jl does not export MPI.Pack or MPI.unpack.
Could MPI.jl developers implement this API and optimize GPU data commnunication?
I think 3D decomposition is necessary for high performance GPU computation.

I use cuda-aware OpenMPI through MPIPreference. So MPI.has_cuda() should be true.

luraess · July 24, 2024, 7:33pm

Something must still not work as expected given the poor performance you are reporting. I guess you are using system provided MPI to have the GPU-aware functionality. Can you confirm you’re selecting it appropriately given MPI.has_cuda() returns false why it should be true.

Maybe the above MWE is not the most optimal one to assess performance, as one may want ti use non-blocking comm as well.

Regarding MPI.Pack and MPI.Unpack, feel free to open an issue on MPI.jl, to ask for this feature.

0samuraiE · July 29, 2024, 4:00pm

Thank you. I tried to find the reason MPI.has_cuda() does not work. But I cannot . I apologize for the delayed response. I’ve been trying various things.
I will issue MPI.Pack and MPI.Unpack.

rveltz · July 29, 2024, 4:09pm

Perhaps this van be useful GitHub - omlins/julia-gpu-course: GPU Programming with Julia - course at the Swiss National Supercomputing Centre (CSCS), ETH Zurich

luraess · July 29, 2024, 6:16pm

No worries. Well, I guess one should make sure that CUDA-aware MPI is really functional and maybe use a better MWE to assess point-to-point comm perf.

Can you report what MPI.versioninfo() and CUDA.versioninfo() are showing?

0samuraiE · July 30, 2024, 8:02am

Thank you. But at a glance, 2D decomposition is not treated maybe.

0samuraiE · July 30, 2024, 8:33am

Somehow MPI.has_cuda() works in local. But I do not remember What I did.
The only thing I remember is that I set vscode setting

{
  "terminal.integrated.env.linux": {
    "JULIA_PKG_USE_CLI_GIT": "true",
    "UCX_WARN_UNUSED_ENV_VARS": "n",
    "UCX_ERROR_SIGNALS": "SIGILL,SIGBUS,SIGFPE"
  }
}

Just to be sure I will share versioninfos.

CUDA runtime 12.5, artifact installation
CUDA driver 12.5
NVIDIA driver 555.58.2

CUDA libraries: 
- CUBLAS: 12.5.3
- CURAND: 10.3.6
- CUFFT: 11.2.3
- CUSOLVER: 11.6.3
- CUSPARSE: 12.5.1
- CUPTI: 2024.2.1 (API 23.0.0)
- NVML: 12.0.0+555.58.2

Julia packages: 
- CUDA: 5.4.3
- CUDA_Driver_jll: 0.9.1+1
- CUDA_Runtime_jll: 0.14.1+0

Toolchain:
- Julia: 1.10.4
- LLVM: 15.0.7

1 device:
  0: NVIDIA GeForce RTX 4070 Ti (sm_89, 10.789 GiB / 11.994 GiB available)
MPIPreferences:
  binary:  system
  abi:     OpenMPI
  libmpi:  libmpi
  mpiexec: mpiexec

Package versions
  MPI.jl:             0.20.20
  MPIPreferences.jl:  0.1.11

Library information:
  libmpi:  libmpi
  libmpi dlpath:  /usr/lib/libmpi.so
  MPI version:  3.1.0
  Library version:  
    Open MPI v5.0.4, package: Open MPI builduser@buildhost Distribution, ident: 5.0.4, repo rev: v5.0.4, Jul 18, 2024
MPI.has_cuda() = true

0samuraiE · July 30, 2024, 8:39am

But in HPC, MPI.has_cuda() stiil does not work.

MPI.has_cuda() = false
MPIPreferences:
  binary:  system
  abi:     OpenMPI
  libmpi:  libmpi
  mpiexec: mpiexec

Package versions
  MPI.jl:             0.20.20
MPIPreferences:
  binary:  system
  abi:     OpenMPI
  libmpi:  libmpi
  mpiexec: mpiexec

Package versions
  MPI.jl:             0.20.20
  MPIPreferences.jl:  0.1.11

Library information:
  libmpi:  libmpi
  libmpi dlpath:  /apps/t4/rhel9/free/openmpi/5.0.2/nvhpc/lib/libmpi.so
  MPI version:  3.1.0
  Library version:  
    Open MPI v5.0.2, package: Open MPI root@login1 Distribution, ident: 5.0.2, repo rev: v5.0.2, Feb 06, 2024
  MPIPreferences.jl:  0.1.11

Library information:
  libmpi:  libmpi
  libmpi dlpath:  /apps/t4/rhel9/free/openmpi/5.0.2/nvhpc/lib/libmpi.so
  MPI version:  3.1.0
  Library version:  
    Open MPI v5.0.2, package: Open MPI root@login1 Distribution, ident: 5.0.2, repo rev: v5.0.2, Feb 06, 2024
CUDA runtime 12.5, artifact installation
CUDA driver 12.3
CUDA runtime 12.5, artifact installation
CUDA driver 12.3
NVIDIA driver 545.23.8

CUDA libraries: 
- CUBLAS: 12.3.4
NVIDIA driver 545.23.8

CUDA libraries: 
- CURAND: 10.3.6
- CUBLAS: 12.3.4
- CUFFT: 11.2.3
- CURAND: 10.3.6
- CUSOLVER: 11.6.3
- CUFFT: 11.2.3
- CUSPARSE: 12.5.1
- CUPTI: 2024.2.1 (API 23.0.0)
- CUSOLVER: 11.6.3
- NVML: 12.0.0+545.23.8

Julia packages: 
- CUDA: 5.4.3
- CUDA_Driver_jll: 0.9.1+1
- CUDA_Runtime_jll: 0.14.1+0

Toolchain:
- Julia: 1.10.4
- LLVM: 15.0.7

- CUSPARSE: 12.5.1
- CUPTI: 2024.2.1 (API 23.0.0)
1 device:
- NVML: 12.0.0+545.23.8

Julia packages: 
- CUDA: 5.4.3
- CUDA_Driver_jll: 0.9.1+1
- CUDA_Runtime_jll: 0.14.1+0

Toolchain:
- Julia: 1.10.4
- LLVM: 15.0.7

1 device:
  0: NVIDIA H100 (sm_90, 93.004 GiB / 93.584 GiB available)

0samuraiE · July 30, 2024, 9:07am

Additionary, I tried MEW using Irecv! and Isend. But still problem occurs.

using MPI
using CUDA
using Statistics

MPI.Init()
comm = MPI.COMM_WORLD
rank = MPI.Comm_rank(comm)
size = MPI.Comm_size(comm)
dest = mod(rank + 1, size)
source = mod(rank - 1, size)
println("rank=$rank, size=$size, dest=$dest, source=$source")
N = 1024
G = 4
A = CUDA.zeros(N, N)
send_mesg = @views A[1+G:2G, :]
recv_mesg = @views A[1:G, :]
fill!(send_mesg, Float64(rank))
CUDA.synchronize()

nitr = 10
elasped = zeros(Float64, nitr)
for i in 1:nitr
    start = MPI.Wtime()

    MPI.Waitall([
        MPI.Isend(send_mesg, comm; dest),
        MPI.Irecv!(recv_mesg, comm; source)
    ])

    finish = MPI.Wtime()

    elasped[i] = finish - start
    rank === 0 && @show elasped[i]
end

println("Average communication time on proc $rank over $nitr iterations: $(mean(elasped[2:end])) seconds")
rank == 0 && println("done.")
MPI.Finalize()

luraess · July 31, 2024, 9:10am

On your HPC system, it seems you are using the artefact for CUDA instead of the system libs which may potentially explain the fact that CUDA-aware MPI is not working. Make sure to select system CUDA libs Overview · CUDA.jl

0samuraiE · August 1, 2024, 1:39pm

Thank you.
I modified LocalPreference to use local nvidia-toolkit.
And I got

MPI.has_cuda() = false
MPIPreferences:
  binary:  system
  abi:     OpenMPI
  libmpi:  libmpi
  mpiexec: mpiexec

Package versions
  MPI.jl:             0.20.20
  MPIPreferences.jl:  0.1.11

Library information:
  libmpi:  libmpi
  libmpi dlpath:  /apps/t4/rhel9/free/openmpi/5.0.2/nvhpc/lib/libmpi.so
  MPI version:  3.1.0
  Library version:  
    Open MPI v5.0.2, package: Open MPI root@login1 Distribution, ident: 5.0.2, repo rev: v5.0.2, Feb 06, 2024
CUDA runtime 12.3, local installation
CUDA driver 12.3
NVIDIA driver 545.23.8

CUDA libraries: 
- CUBLAS: 12.3.4
- CURAND: 10.3.4
- CUFFT: 11.0.12
- CUSOLVER: 11.5.4
- CUSPARSE: 12.2.0
- CUPTI: 2023.3.1 (API 21.0.0)
- NVML: 12.0.0+545.23.8

Julia packages: 
- CUDA: 5.4.3
- CUDA_Driver_jll: 0.9.1+1
- CUDA_Runtime_jll: 0.14.1+0
- CUDA_Runtime_Discovery: 0.3.4

Toolchain:
- Julia: 1.10.4
- LLVM: 15.0.7

Preferences:
- CUDA_Runtime_jll.version: 12.3
- CUDA_Runtime_jll.local: true

1 device:
  0: NVIDIA H100 (sm_90, 93.004 GiB / 93.584 GiB available)

I suspect ENV so tried

UCX_WARN_UNUSED_ENV_VARS=n UCX_ERROR_SIGNALS="SIGILL,SIGBUS,SIGFPE" mpiexec -n 4 julia --project tmp.jl

But still MPI.has_cuda() = false.

0samuraiE · August 5, 2024, 12:36pm

@luraess I followed your method and achieved a speedup of 60s → 0.1s in my simulation code. Thanks for sticking with me for this long discussion.
I’ll see what I can do about the other problem. I’m truly thankful for your help!

Topic		Replies	Views
Error/segfault in basic test of CUDA-aware MPI Julia at Scale question	10	1420	November 6, 2020
Question about CUDA-aware MPI GPU	1	861	April 17, 2020
CUDA aware MPI works on system but not for Julia Julia at Scale parallel , mpi	30	3009	January 24, 2022
ANN: MPI.jl v0.10.0: new build process and CUDA-aware support Julia at Scale	26	2735	October 31, 2020
CUDA aware MPI fails but runs on multiple GPUs Julia at Scale	5	779	July 21, 2021

Using views of CuArray with CUDA-aware MPI is extremely slow

Related topics