CUDA aware MPI works on system but not for Julia

Chiil · January 23, 2022, 9:50pm

Disabling the memory pool did the trick, thanks!. All other suggestions I already tested thanks to the great help of @samo. I do not know which conclusion to draw from this, though. Does that mean that the memory pool and CUDA-aware MPI are incompatible together? Or is this something that needs fixing in our MPI installation, or in MPI.jl or CUDA.jl?

luraess · January 23, 2022, 10:41pm

Glad this did the trick!

Does that mean that the memory pool and CUDA-aware MPI are incompatible together?

Yes, it seems that with CUDA-aware MPI, memory management by CUDA.jl “conflicts” with what CUDA-aware MPI expects (see https://juliagpu.gitlab.io/CUDA.jl/usage/memory/#Memory-pool and https://juliagpu.gitlab.io/CUDA.jl/usage/memory/#Environment-variables). Setting the memory pool to none “directly defers to the CUDA allocator”.

maleadt · January 24, 2022, 7:22am

Nothing CUDA.jl does here is fancy; rather CUDA’s own APIs are incompatible with itself: Legacy cuIpc* APIs incompatible with stream-ordered allocator · Issue #1053 · JuliaGPU/CUDA.jl · GitHub.

CUDA.jl upgrades your driver library by using a forward-compatible libcuda.so.

Why? What’s inherently incompatible between our CUDA artifacts and MPI? The issue here seems the IPC/memory pool incompatibility, not the actual CUDA binaries.

Chiil · January 24, 2022, 8:03am

Alright, that is very useful information. I can work without the memory pool for now, and hopefully this gets resolved in the future. If it helps to provide failing test programs’s anywhere let me know, but I conclude from your answer that the problem is already very clear to the developers.

samo · January 24, 2022, 12:48pm

I think there is no reason to believe that they would be inherently incompatible. It certainly is only a matter of the installation.

However, as far as I am aware, it is currently not possible to have CUDA-aware MPI.jl working without using a system installed CUDA-aware MPI, and in a CUDA-aware installation, we need to specify which CUDA installation to use. So, it seems natural to use also this very same system installed CUDA for CUDA.jl in order to be sure that all works smoothly. Now, it I could imagine that it is possible to either 1) first install CUDA.jl with artifacts and install CUDA-aware system MPI with it or 2) to just use the CUDA libraries from the CUDA.jl artifacts at runtime. I think 1) could be of interest in order to avoid having to install CUDA manually; I am not sure how much benefits or problems could bring 2). Do you have any coments on 1) and 2)?

In any case, I think most interesting would be, if at some point, MPI.jl could support CUDA-aware MPI without having to rely on a system-installed MPI. This would be nice for small clusters or multi-GPU workstations - at least for a quick-start or a fallback (on supercomputers, a system-optimized MPI - as Cray-MPICH in our case - will certainly always be preferred). @simonbyrne, could you maybe comment on the feasibility of this and if this is anywhere on the sky?

maleadt · January 24, 2022, 1:18pm

Ah ok, so the MPI back-end JLLs selected by MPI.jl (assuming JLLs are used, and it’s not just the system version again) need to build against the same CUDA version used by CUDA.jl. That’s a work in progress, and not possible yet (it needs a CUDA_jll.jl that can be used by both CUDA.jl and those MPI back-ends).

The other advantage is that in general CUDA.jl does a better job selecting a CUDA toolkit that’s supported by your system, both in terms of compatibility and in selecting the most up-to-date version, which matters since there’s known compilation bugs with all but the most recent CUDA compiler.

simonbyrne · January 24, 2022, 6:46pm

Possibly? @vchuravy did get UCX_jll to build on a small number of platforms with CUDA support, but it’s not clear if this is feasible more generally.

samo · January 24, 2022, 7:43pm

Thanks @maleadt and @simonbyrne. It is good to have from time to time a review of the situation

ToucheSir · January 24, 2022, 9:28pm

Does this work have any connection to https://github.com/JuliaPackaging/Yggdrasil/issues/2063? Asking because AFAIK MPI.jl provides the only GPU-compatible broadcast + allreduce interface in Julia land, but deep learning framework users are unlikely to have a compatible system MPI installation.

Chiil · January 24, 2022, 9:42pm

No it hasn’t. I made a Julia port https://github.com/Chiil/MicroHH.jl of our C++/CUDA atmospheric simulator https://github.com/microhh/microhh and was trying how hard it is to get CUDA-aware MPI running.

vchuravy · January 24, 2022, 10:15pm

Yes we should be able to handle that generally, once some of the artifact work has propagated a bit.

Topic		Replies	Views
Question about CUDA-aware MPI GPU	1	860	April 17, 2020
CUDA aware MPI fails but runs on multiple GPUs Julia at Scale	5	772	July 21, 2021
ANN: MPI.jl v0.10.0: new build process and CUDA-aware support Julia at Scale	26	2731	October 31, 2020
Error/segfault in basic test of CUDA-aware MPI Julia at Scale question	10	1418	November 6, 2020
MPI RMA with CUDA GPU question	2	315	March 17, 2023

CUDA aware MPI works on system but not for Julia

Related topics