CUDA aware MPI works on system but not for Julia

Chiil · January 23, 2022, 9:39am

I have problems getting CUDA-aware MPI running in Julia. I have a C++ example that works perfectly if I run it with mpiexec -n 2 ./alltoall_test. The equivalent Julia code run with mpiexec -n 2 julia alltoall_test.jl fails on a segmentation fault with [1642930332.032032] [gcn19:4087661:0] gdr_copy_md.c:122 UCX ERROR gdr_pin_buffer failed. length :65536 ret:22. I have set up the MPI using the system MPI and it reports the correct version, so my impression is that the problem lies on the CUDA side which wants to download artefacts as soon as I use CuArray, despite CUDA being installed. My impression is that am using an CUDA version that is incompatible with the compilation of MPI, but I do not know how to verify this. I used the JULIA_CUDA_USE_BINARY_BUILDER=false setting. The two test codes are:

using MPI
using CUDA

np = 2

MPI.Init()

comm = MPI.COMM_WORLD
mpiid = MPI.Comm_rank(comm)
print("The MPI rank is: $mpiid\n")

device!(mpiid)
print("The CUDA device is: $(device())\n")

n = 1024
data_cpu = rand(n)
data_out_cpu = similar(data_cpu)
data = CuArray(data_cpu)
data_out = similar(data)

# Test the alltoall on the CPU
mpi_data_cpu = MPI.UBuffer(data_cpu, 512)
mpi_data_out_cpu = MPI.UBuffer(data_out_cpu, 512)
@time MPI.Alltoall!(mpi_data_cpu, mpi_data_out_cpu, comm)
@time MPI.Alltoall!(mpi_data_cpu, mpi_data_out_cpu, comm)

# Test the alltoall on the GPU
print("$mpiid has CUDA: $(MPI.has_cuda())\n")
mpi_data = MPI.UBuffer(data, 512)
mpi_data_out = MPI.UBuffer(data_out, 512)
@time MPI.Alltoall!(mpi_data, mpi_data_out, comm)
@time MPI.Alltoall!(mpi_data, mpi_data_out, comm)

# Close the MPI.
MPI.Finalize()

and

#include <iostream>
#include <vector>
#include <mpi.h>
#include <cuda_runtime_api.h>
#include <cuda.h>
#include <chrono>

int main()
{
    MPI_Init(NULL, NULL);

    int n, id;
    MPI_Comm_size(MPI_COMM_WORLD, &n);
    MPI_Comm_rank(MPI_COMM_WORLD, &id);

    const size_t size_tot = 1024*1024*1024;
    const size_t size_max = size_tot / n;

    // CPU TEST
    std::vector<double> a_cpu_in (size_tot);
    std::vector<double> a_cpu_out(size_tot);
    std::fill(a_cpu_in.begin(), a_cpu_in.end(), id);

    std::cout << id << ": Starting CPU all-to-all\n";
    auto time_start = std::chrono::high_resolution_clock::now();
    MPI_Alltoall(
            a_cpu_in .data(), size_max, MPI_DOUBLE,
            a_cpu_out.data(), size_max, MPI_DOUBLE,
            MPI_COMM_WORLD);
    auto time_end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration<double, std::milli>(time_end-time_start).count();
    std::cout << id << ": Finished CPU all-to-all in " << std::to_string(duration) << " (ms)\n";

    // GPU TEST
    int id_local = id % 4;
    cudaSetDevice(id_local);
    double* a_gpu_in;
    double* a_gpu_out;
    cudaMalloc((void **)&a_gpu_in , size_tot * sizeof(double));
    cudaMalloc((void **)&a_gpu_out, size_tot * sizeof(double));
    cudaMemcpy(a_gpu_in, a_cpu_in.data(), size_tot*sizeof(double), cudaMemcpyHostToDevice);

    int id_gpu;
    cudaGetDevice(&id_gpu);
    std::cout << id << ", " << id_local << ", " << id_gpu << ": Starting GPU all-to-all\n";
    time_start = std::chrono::high_resolution_clock::now();
    MPI_Alltoall(
            a_gpu_in , size_max, MPI_DOUBLE,
            a_gpu_out, size_max, MPI_DOUBLE,
            MPI_COMM_WORLD);
    time_end = std::chrono::high_resolution_clock::now();
    duration = std::chrono::duration<double, std::milli>(time_end-time_start).count();

    std::cout << id << ", " << id_local << ", " << id_gpu << ": Finished GPU all-to-all in " << std::to_string(duration) << " (ms)\n";

    MPI_Finalize();
    return 0;
}

samo · January 23, 2022, 10:16am

You need to set JULIA_CUDA_USE_BINARY_BUILDER=false not only for the installation of CUDA.jl, but also have it set each time you are running a code using CUDA.jl. Did you do that? Else, it will use the binary builder as this is something checked each time at runtime.

Chiil · January 23, 2022, 10:25am

I did that, but it does not seem to work. I have it already loaded as a environment variable, but I see this happening at the moment the code reaches the first data = CuArray(data_cpu) line.

chiel@gcn19:/gpfs/scratch1/shared/chiel/MicroHH.jl/test$ JULIA_CUDA_USE_BINARY_BUILDER=false mpiexec -n 2 julia alltoall_test.jl 
The MPI rank is: 1
The MPI rank is: 0
The CUDA device is: CuDevice(0)
The CUDA device is: CuDevice(1)
 Downloading Downloading artifact: CUDA
 artifact: CUDA

samo · January 23, 2022, 10:43am

JULIA_CUDA_USE_BINARY_BUILDER=false

needs to be set for each process. One way could be to put it in ~.bashrc or similar, another way to set it explicitly everywhere, i.e. the idea is:

mpiexec -n 2 'JULIA_CUDA_USE_BINARY_BUILDER=false julia alltoall_test.jl'

Maybe you will need to put JULIA_CUDA_USE_BINARY_BUILDER=false julia alltoall_test.jl in a shell script and call

mpiexec -n 2 my_julia_multi_gpu_app.sh

Chiil · January 23, 2022, 10:57am

This does not work, I added a line to my script to check the variable (which is now in my .bashrc) and it has the correct value, yet it starts installing its own CUDA:

The MPI rank is: 1
The MPI rank is: 0
The CUDA device is: CuDevice(1), JULIA_CUDA_USE_BINARY_BUILDER is false
The CUDA device is: CuDevice(0), JULIA_CUDA_USE_BINARY_BUILDER is false
 Downloading Downloading artifact: CUDA
 artifact: CUDA

samo · January 23, 2022, 11:08am

In the Julia module on the Piz Daint supercomputer, we have set

setenv		JULIA_MPI_BINARY system
setenv		JULIA_CUDA_USE_BINARYBUILDER false

(and some more environment variables for MPI). CUDA artifacts are not downloaded and CUDA-aware MPI works.

Now, what happens if you do mpiexec -n 1 ...? Does it still download the artifacts? And what if you run the same without artifacts? This should help to further localize the issue…

Moreover, did you install MPI with JULIA_MPI_BINARY=system? Else I think CUDA-aware MPI cannot be supported yet.

samo · January 23, 2022, 11:14am

One more thing: to make it easier to choose the right MPI library etc., you can use the Julia mpiexec wrapper:
https://juliaparallel.github.io/MPI.jl/stable/configuration/#Julia-wrapper-for-mpiexec

Chiil · January 23, 2022, 11:36am

Also with mpiexec -n 1 the artifact download starts. I do not know how to suppress this. The CUDA-aware MPI works perfectly in the enclosed C++ example, which gives me still hope that somehow I could fix this. I also installed the MPI with the JULIA_MPI_BINARY=system so I am a little confused where to look now.

samo · January 23, 2022, 11:38am

And when you run it without mpiexec?
Moreover, you should run it with mpiexecjl as described in the link I shared above.

Chiil · January 23, 2022, 11:52am

I deleted my .julia folder to reinstall the packages again. I manage now to avoid the artifact download, but the error remains the same with a segfault. I have tested both mpiexec and mpiexecjl. A single process run (which probably skips the entire transfer in the MPI.Alltoall function does not crash.

samo · January 23, 2022, 12:11pm

That is good, the first issue - downloading of CUDA artifacts - is solved.

For the second issue - the error message - I assume that you know how to set the required variables for CUDA-aware MPI in general as you say it works with C++…

So, try maybe this in order to see if it is related to the functionality you are using in you all-to-all example:

using MPI
using CUDA
MPI.Init()
comm = MPI.COMM_WORLD
rank = MPI.Comm_rank(comm)
size = MPI.Comm_size(comm)
dst = mod(rank+1, size)
src = mod(rank-1, size)
println("rank=$rank, size=$size, dst=$dst, src=$src")
N = 4
send_mesg = CuArray{Float64}(undef, N)
recv_mesg = CuArray{Float64}(undef, N)
fill!(send_mesg, Float64(rank))
#rreq = MPI.Irecv!(recv_mesg, src,  src+32, comm)
MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm)
println("recv_mesg on proc $rank: $recv_mesg")

If the problem remains, set LD_PRELOAD in order to point to your libcuda.so and libcudart.so. On Piz Daint this was, e.g., done as follows: LD_PRELOAD=/usr/lib64/libcuda.so:/usr/local/cuda/lib64/libcudart.so
This was a workaround required there before Cray fixed an issue in Cray-MPICH…

Chiil · January 23, 2022, 12:27pm

Same error:

rank=1, size=2, dst=0, src=0
rank=0, size=2, dst=1, src=1
[1642940044.417983] [gcn19:4129247:0]    gdr_copy_md.c:122  UCX  ERROR gdr_pin_buffer failed. length :65536 ret:22

signal (11): Segmentation fault
in expression starting at /gpfs/scratch1/shared/chiel/MicroHH.jl/test/help.jl:15
uct_gdr_copy_mkey_pack at /tmp/jenkins/build/UCXCUDA/1.10.0/GCCcore-10.3.0-CUDA-11.3.1/ucx-1.10.0/src/uct/cuda/gdr_copy/gdr_copy_md.c:68

Now with your preloading suggestion. I did not find a libcuda.so, except in the stubs folder in the lib. Preloading those with export LD_PRELOAD=/sw/arch/Centos8/EB_production/2021/software/CUDA/11.3.1/lib/libcudart.so:/sw/arch/Centos8/EB_production/2021/software/CUDA/11.3.1/lib/stubs/libcuda.so give me on the line device!(mpiid):

CUDA error: OS call failed or operation not supported on this OS (code 304, ERROR_OPERATING_SYSTEM)
Stacktrace:
  [1] throw_api_error(
signal (15): Terminated
in expression starting at none:0

samo · January 23, 2022, 12:40pm

First hit in google for your error is this: https://github.com/NVIDIA/gdrcopy/issues/120
Maybe there is an issue either related to multi-threading or to having multiple processes accessing the same GPU (however you printed the devices accessed) and you need to set a related environment variable.

Chiil · January 23, 2022, 12:50pm

I have checked that, but I am not using threads and the assignment of devices to MPI tasks seem to go OK. I noticed that CUDA does not like the stub library and gives an error over that. If I only preload the libcudart.so then CUDA.version() gives me this: v"11.6.0" which is not my cuda version, because this is 11.3.1.

samo · January 23, 2022, 1:21pm

If you run Julia with debug info for CUDA (JULIA_DEBUG=CUDA julia), then you should see which libraries it uses when you do CUDA.version() or CUDA.versioninfo().

Chiil · January 23, 2022, 1:33pm

I tried it and indeed it is pointing to artifact CUDA. I deleted the artifact and did it again and then, despite the correct environment variables, it started downloading the artifact again. I think I need to point Julia in more detail to the correct and already installed CUDA but do not know how.

┌ Debug: Trying to use artifacts...
└ @ CUDA.Deps ~/.julia/packages/CUDA/nYggH/deps/bindeps.jl:131
┌ Debug: Selecting artifacts based on driver compatibility 11.6.0
└ @ CUDA.Deps ~/.julia/packages/CUDA/nYggH/deps/bindeps.jl:143
 Downloading artifact: CUDA
 Downloaded artifact: CUDA
┌ Debug: Using CUDA 11.6.0 from an artifact at /home/chiel/.julia/artifacts/7b09e1deca842d1e5467b6f7a8ec5a96d47ae0b4
└ @ CUDA.Deps ~/.julia/packages/CUDA/nYggH/deps/bindeps.jl:168
CUDA toolkit 11.6, artifact installation
NVIDIA driver 470.57.2, for CUDA 11.4
CUDA driver 11.6

Libraries: 
- CUBLAS: 11┌ Debug:  cuBLAS (v11.6) function cublasStatus_t cublasGetProperty(libraryPropertyType, int*) called:
│   type: type=SOME TYPE; val=0
│   value: type=int; val=POINTER (IN HEX:0x0x14e36fa302e0)
│  Time: 2022-01-23T14:29:46 elapsed from start 0.000000 minutes or 0.000000 seconds
│ Process=2487884; Thread=22969315264320; GPU=0; Handle=POINTER (IN HEX:0x(nil))
│  COMPILED WITH: GNU GCC/G++ / 6.3.1 20170216 (Red Hat 6.3.1-3)
└ @ CUDA.CUBLAS ~/.julia/packages/CUDA/nYggH/lib/cublas/CUBLAS.jl:220
.┌ Debug:  cuBLAS (v11.6) function cublasStatus_t cublasGetProperty(libraryPropertyType, int*) called:
│   type: type=SOME TYPE; val=1
│   value: type=int; val=POINTER (IN HEX:0x0x14e36f10d8b0)
│  Time: 2022-01-23T14:29:46 elapsed from start 0.000000 minutes or 0.000000 seconds
│ Process=2487884; Thread=22969315264320; GPU=0; Handle=POINTER (IN HEX:0x(nil))
│  COMPILED WITH: GNU GCC/G++ / 6.3.1 20170216 (Red Hat 6.3.1-3)
└ @ CUDA.CUBLAS ~/.julia/packages/CUDA/nYggH/lib/cublas/CUBLAS.jl:220
8┌ Debug:  cuBLAS (v11.6) function cublasStatus_t cublasGetProperty(libraryPropertyType, int*) called:
│   type: type=SOME TYPE; val=2
│   value: type=int; val=POINTER (IN HEX:0x0x14e36f10d8c0)
│  Time: 2022-01-23T14:29:46 elapsed from start 0.000000 minutes or 0.000000 seconds
│ Process=2487884; Thread=22969315264320; GPU=0; Handle=POINTER (IN HEX:0x(nil))
│  COMPILED WITH: GNU GCC/G++ / 6.3.1 20170216 (Red Hat 6.3.1-3)
│ 
└ @ CUDA.CUBLAS ~/.julia/packages/CUDA/nYggH/lib/cublas/CUBLAS.jl:220
.1
- CURAND: 10.2.9
- CUFFT: 10.7.0
- CUSOLVER: 11.3.2
- CUSPARSE: 11.7.1
- CUPTI: 16.0.0
- NVML: 11.0.0+470.57.2
┌ Debug: Using CUDNN from an artifact at /home/chiel/.julia/artifacts/b0757335df76c8a6732f8261b705210afd7d2583
└ @ CUDA.Deps ~/.julia/packages/CUDA/nYggH/deps/bindeps.jl:536
- CUDNN: 8.30.2 (for CUDA 11.5.0)┌ Debug: CuDNN (v8302) function cudnnGetVersion() called:
│ Time: 2022-01-23T14:29:47.448540 (0d+0h+0m+0s since start)
│ Process=2487884; Thread=2487884; GPU=NULL; Handle=NULL; StreamId=NULL.
└ @ CUDA.CUDNN ~/.julia/packages/CUDA/nYggH/lib/cudnn/CUDNN.jl:134

┌ Debug: Using CUTENSOR library cutensor from an artifact at /home/chiel/.julia/artifacts/b4714d43eda0a77581c8664d279f7456a0adfb47
└ @ CUDA.Deps ~/.julia/packages/CUDA/nYggH/deps/bindeps.jl:609
- CUTENSOR: 1.4.0 (for CUDA 11.5.0)

Toolchain:
- Julia: 1.7.1
- LLVM: 12.0.1
┌ Debug: Toolchain with LLVM 12.0.1, CUDA driver 11.6 and toolkit 11.6 supports devices 3.5, 3.7, 5.0, 5.2, 5.3, 6.0, 6.1, 6.2, 7.0, 7.2, 7.5 and 8.0; PTX 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5 and 7.0
└ @ CUDA.Deps ~/.julia/packages/CUDA/nYggH/deps/compatibility.jl:210
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
┌ Debug: Toolchain with LLVM 12.0.1, CUDA driver 11.6 and toolkit 11.6 supports devices 3.5, 3.7, 5.0, 5.2, 5.3, 6.0, 6.1, 6.2, 7.0, 7.2, 7.5 and 8.0; PTX 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5 and 7.0
└ @ CUDA.Deps ~/.julia/packages/CUDA/nYggH/deps/compatibility.jl:210
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

Environment:
- JULIA_CUDA_USE_BINARY_BUILDER: false

1 device:
  0: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)

samo · January 23, 2022, 1:59pm

You typed the variable name wrong: it must be JULIA_CUDA_USE_BINARYBUILDER, not JULIA_CUDA_USE_BINARY_BUILDER. Better copy-paste long variable names
That should solve the artifact downloading issue. Then, we will see if there is still another issue…

Chiil · January 23, 2022, 2:15pm

Excellent observation, sorry for missing that! This at least brings me to an almost correct CUDA installation, but still with error. I wonder where this 11.6 CUDA driver comes from, nvidia-smi gives me: NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 , so I do not know where this 11.6 originates from.

CUDA toolkit 11.3, local installation
NVIDIA driver 470.57.2, for CUDA 11.4
CUDA driver 11.6

It does give more error information though:

[1642947535.067588] [gcn19:4179357:0]    cuda_ipc_md.c:233  UCX  ERROR cuIpcGetMemHandle(&key->ph, (CUdeviceptr)addr)() failed: invalid argument

samo · January 23, 2022, 3:26pm

Chiil:

Excellent observation, sorry for missing that! This at least brings me to an almost correct CUDA installation, but still with error. I wonder where this 11.6 CUDA driver comes from, nvidia-smi gives me: NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 , so I do not know where this 11.6 originates from.
CUDA toolkit 11.3, local installation
NVIDIA driver 470.57.2, for CUDA 11.4
CUDA driver 11.6

I suppose @maleadt can immediately tell how to interpret this…

luraess · January 23, 2022, 9:08pm

A minor addition to @samo 's hints, you could try setting the CUDA memory pool to none:

export JULIA_CUDA_MEMORY_POOL=none

This may help for the CUDA-aware MPI error.

For the artifact download issue, I’d make sure, starting from scratch once more, to:

Have MPI and CUDA on path (or module loaded) that were used to build the CUDA-aware MPI

Make sure to have:

export JULIA_CUDA_MEMORY_POOL=none
export JULIA_MPI_BINARY=system
export JULIA_CUDA_USE_BINARYBUILDER=false

Add CUDA and MPI packages in Julia. Build MPI.jl in verbose mode to check whether correct versions are built/used:
```
julia -e 'using Pkg; pkg"add CUDA"; pkg"add MPI"; Pkg.build("MPI"; verbose=true)'
```
Then in Julia, upon loading MPI and CUDA modules, you can check
- CUDA version: CUDA.versioninfo()
- If MPI has CUDA: MPI.has_cuda()
- If you are using correct MPI implementation: MPI.identify_implementation()

After that, running the simple test script @samo suggested here, launching it from a shell script as in here should make it.