Multi-GPU single host training example

Hello - I am completely new to Julia and just started playing around with it in the last couple of days over my winter break. Its a great language :slight_smile: Thank you for building/maintaining it!

Can somebody point me to an end-to-end example of training a deep learning model using multiple GPUs? Specifically I am looking for an example where the training happens on a single host leveraging GPU-to-GPU comms (e.g., with NCCL) instead of going via the CPU. Essentially I am looking for an equivalent of PyTorch’s DDP.

Here is what I have found so far on this topic:

  • Flux tutorial with GPU - this is for a single GPU use case.
  • CUDA.jl multiple GPU support - this explicitly mentions that the comms is via the CPU.
  • MPI + CUDA - there is not a lot of information here except that I have to recompile MPI with CUDA flags.
  • Dagger.jl - this seems to be a Ray like DAG execution engine.
  • DaggerGPU - again not a lot of info, except perhaps if I do a deep dive into Dagger this might start to make sense.
  • DaggerFlux - this seems most promising but the example is again for a single GPU.
  • Julia computing webinar - signup sheet to a webinar with a promising title that happened earlier in the year but could not find a recording of this.

Any pointers are much appreciated!

Regards,

Maybe you find the information included in those two threads / sites useful: How to use multiple GPUs on a single node · Issue #69 · jonathan-laurent/AlphaZero.jl · GitHub and GitHub - fabricerosay/AlphaGPU: Alphazero on GPU thanks to CUDA.jl, especially in the first link, however, I have to admit that I am not sure if this is leveraging GPU-to-GPU comms, completely omitting CPUs.

GitHub - avik-pal/FluxMPI.jl: MultiGPU / MultiNode Training of Flux Models is the closest to what you’re looking for, but does require a CUDA-aware MPI. I’m not aware of an easy way to obtain that without involving HPC stuff, but worth a try.

Longer term, DaggerFlux should get DDP-like sync through Optimize CuArrayDeviceProc with IPC and DtoD by jpsamaroo · Pull Request #17 · JuliaGPU/DaggerGPU.jl · GitHub. My understanding is that needs to get wired up first, however.

One challenge to this kind of training is that we don’t have a wrapper or equivalent Julia-level API for NCCL’s collective communication operations. If GitHub - JuliaGPU/NCCL.jl: A Julia wrapper for the NVIDIA Collective Communications Library. was resuscitated, a version of FluxMPI with that instead of MPI would be cool to have.

2 Likes

It looks like you are well informed on those topics. Thanks for the additional info. Its very useful.

Thanks for all the pointers. It seems that there is still work to be done in terms of having a “batteries included” framework for efficient multi-GPU training. Will keep plugging away at it and continue to watch this space with great interest!

1 Like

Just to be clear, FluxMPI should work right now. Setting up MPI isn’t something most deep learning practitioners are used to, but it shouldn’t be too too bad (some PyTorch models require it, for example).

Longer-term, it would be real nice to get some help bringing NCCL.jl and GitHub - JuliaParallel/UCX.jl up to speed. Take this is a big, flashing “help wanted” sign.

1 Like