Hello - I am completely new to Julia and just started playing around with it in the last couple of days over my winter break. Its a great language Thank you for building/maintaining it!
Can somebody point me to an end-to-end example of training a deep learning model using multiple GPUs? Specifically I am looking for an example where the training happens on a single host leveraging GPU-to-GPU comms (e.g., with NCCL) instead of going via the CPU. Essentially I am looking for an equivalent of PyTorch’s DDP.
https://github.com/avik-pal/FluxMPI.jl is the closest to what you’re looking for, but does require a CUDA-aware MPI. I’m not aware of an easy way to obtain that without involving HPC stuff, but worth a try.
One challenge to this kind of training is that we don’t have a wrapper or equivalent Julia-level API for NCCL’s collective communication operations. If https://github.com/JuliaGPU/NCCL.jl was resuscitated, a version of FluxMPI with that instead of MPI would be cool to have.
Thanks for all the pointers. It seems that there is still work to be done in terms of having a “batteries included” framework for efficient multi-GPU training. Will keep plugging away at it and continue to watch this space with great interest!
Just to be clear, FluxMPI should work right now. Setting up MPI isn’t something most deep learning practitioners are used to, but it shouldn’t be too too bad (some PyTorch models require it, for example).
Longer-term, it would be real nice to get some help bringing NCCL.jl and https://github.com/JuliaParallel/UCX.jl up to speed. Take this is a big, flashing “help wanted” sign.
I’m wondering if it couldn’t be possible to perform multi-gpu single host training only with the mechanics CUDA.jl currently offers. I unfortunately don’t have a multi-gpu setup to confirm, but given that peer-to-peer communication has been made possible in CUDA.jl v3 (CUDA.jl 3.5-3.8 ⋅ JuliaGPU), and looking at the Multiple GPUs · CUDA.jl, couldn’t a data parallel appraoch be hacked?
For example, something going along those lines:
# copy model on each device
device!(0)
m0 = gpu(m)
device!(1)
m1 = gpu(m)
for (x, y) in dataloader
# copy each sub-batch to a device
...
@sync begin
@async begin
device!(0)
copyto!(x0, :, view(x, :, 1:batch))
copyto!(y0, :, view(y, 1:batch))
grads0 = gradient((model) -> loss(model, x, y), m)[1]
end
@async begin
device!(1)
copyto!(x1, :, view(x, :, batch+1:end))
copyto!(y1, :, view(y, batch+1:end))
grads1 = gradient((model) -> loss(model, x1, y1), m)[1]
end
# now needs to accumulate grads from other gpus on a target one prior to update model weights
end
The above quite not complete, but seems within the scope of CUDA.jl?
It definitely should be. CUDA.jl also exposes but doesn’t wrap low-level IPC APIs for cross-process communication. DaggerGPU uses those, for example. The remaining effort would be writing custom collective operations like all-reduce for data-parallel training, and the question is whether wrapping the existing ones in a library like NCCL might be easier than doing so.