Multi-GPU single host training example

Hello - I am completely new to Julia and just started playing around with it in the last couple of days over my winter break. Its a great language :slight_smile: Thank you for building/maintaining it!

Can somebody point me to an end-to-end example of training a deep learning model using multiple GPUs? Specifically I am looking for an example where the training happens on a single host leveraging GPU-to-GPU comms (e.g., with NCCL) instead of going via the CPU. Essentially I am looking for an equivalent of PyTorch’s DDP.

Here is what I have found so far on this topic:

  • Flux tutorial with GPU - this is for a single GPU use case.
  • CUDA.jl multiple GPU support - this explicitly mentions that the comms is via the CPU.
  • MPI + CUDA - there is not a lot of information here except that I have to recompile MPI with CUDA flags.
  • Dagger.jl - this seems to be a Ray like DAG execution engine.
  • DaggerGPU - again not a lot of info, except perhaps if I do a deep dive into Dagger this might start to make sense.
  • DaggerFlux - this seems most promising but the example is again for a single GPU.
  • Julia computing webinar - signup sheet to a webinar with a promising title that happened earlier in the year but could not find a recording of this.

Any pointers are much appreciated!

Regards,

2 Likes

Maybe you find the information included in those two threads / sites useful: How to use multiple GPUs on a single node · Issue #69 · jonathan-laurent/AlphaZero.jl · GitHub and GitHub - fabricerosay/AlphaGPU: Alphazero on GPU thanks to CUDA.jl, especially in the first link, however, I have to admit that I am not sure if this is leveraging GPU-to-GPU comms, completely omitting CPUs.

https://github.com/avik-pal/FluxMPI.jl is the closest to what you’re looking for, but does require a CUDA-aware MPI. I’m not aware of an easy way to obtain that without involving HPC stuff, but worth a try.

Longer term, DaggerFlux should get DDP-like sync through https://github.com/JuliaGPU/DaggerGPU.jl/pull/17. My understanding is that needs to get wired up first, however.

One challenge to this kind of training is that we don’t have a wrapper or equivalent Julia-level API for NCCL’s collective communication operations. If https://github.com/JuliaGPU/NCCL.jl was resuscitated, a version of FluxMPI with that instead of MPI would be cool to have.

3 Likes

It looks like you are well informed on those topics. Thanks for the additional info. Its very useful.

Thanks for all the pointers. It seems that there is still work to be done in terms of having a “batteries included” framework for efficient multi-GPU training. Will keep plugging away at it and continue to watch this space with great interest!

1 Like

Just to be clear, FluxMPI should work right now. Setting up MPI isn’t something most deep learning practitioners are used to, but it shouldn’t be too too bad (some PyTorch models require it, for example).

Longer-term, it would be real nice to get some help bringing NCCL.jl and https://github.com/JuliaParallel/UCX.jl up to speed. Take this is a big, flashing “help wanted” sign.

1 Like

Has there been a solution to this yet?

FluxMPI is still very much a thing if you can get yourself a CUDA-aware MPI. CUDA.jl also ships with NCCL artifacts, so if someone wants to try resuscitating GitHub - JuliaGPU/NCCL.jl: A Julia wrapper for the NVIDIA Collective Communications Library. I’d be happy to provide guidance there.

1 Like

I’m wondering if it couldn’t be possible to perform multi-gpu single host training only with the mechanics CUDA.jl currently offers. I unfortunately don’t have a multi-gpu setup to confirm, but given that peer-to-peer communication has been made possible in CUDA.jl v3 (CUDA.jl 3.5-3.8 ⋅ JuliaGPU), and looking at the Multiple GPUs · CUDA.jl, couldn’t a data parallel appraoch be hacked?

For example, something going along those lines:


# copy model on each device
device!(0)
m0 = gpu(m)
device!(1)
m1 = gpu(m)

for (x, y) in dataloader
    # copy each sub-batch to a device
   ...
@sync begin
    @async begin
        device!(0)
         copyto!(x0, :, view(x, :, 1:batch))
         copyto!(y0, :, view(y, 1:batch))
         grads0 = gradient((model) -> loss(model, x, y), m)[1]
    end
    @async begin
        device!(1)
        copyto!(x1, :, view(x, :, batch+1:end))
        copyto!(y1, :, view(y, batch+1:end))
        grads1 = gradient((model) -> loss(model, x1, y1), m)[1]
    end
    # now needs to accumulate grads from other gpus on a target one prior to update model weights
end

The above quite not complete, but seems within the scope of CUDA.jl?

1 Like

It definitely should be. CUDA.jl also exposes but doesn’t wrap low-level IPC APIs for cross-process communication. DaggerGPU uses those, for example. The remaining effort would be writing custom collective operations like all-reduce for data-parallel training, and the question is whether wrapping the existing ones in a library like NCCL might be easier than doing so.