Has anyone every used multiple GPUs with Flux? If so, did you have to modify your code to run on multiple GPUs?
Yes, I use Flux with multiple GPUs. But my approach is not elegant.
In CUDAnative (commit f60c4754356225151c866da0ca512b434aa03abd) in src/CUDAnative.jl on line 75 I replaced
default_device = CuDevice(0) with
dev = haskey(ENV, "CUDADEV") ? parse(ENV["CUDADEV"]) : 0; default_device = CuDevice(dev), after which I can (in bash)
export CUDADEV=1 && julia to open a julia session with CUDAnative using GPU 1.
You’ll be able to do
device!(::CuDevice) on Julia 0.7 (technically, on any recent CUDAdrv/CUDAnative, but those are Julia 0.7 only). Do note that currently CuArray allocations are tied to a device, for better usability we should probably move to unified memory.
Is there any live examples?
Only https://github.com/JuliaGPU/CUDAnative.jl/blob/master/examples/multigpu.jl, but that uses unified memory which you probably don’t need. You could try and put all code that allocates and execute on the GPU within
Has there been some prigress wrt multi gpu trainig with Flux in the meantime? A mwe or tutorial would be nice, showing distributed training on multiple nodes with multiple gpus each.
For Flux in particular, you’ll probably want a combination of GitHub - DhairyaLGandhi/DaggerFlux.jl for the user-facing API and Optimize CuArrayDeviceProc with IPC and DtoD by jpsamaroo · Pull Request #17 · JuliaGPU/DaggerGPU.jl · GitHub for the GPU transport side of things. If you have access to a CUDA-aware MPI, something like GitHub - AStupidBear/DistributedFlux.jl is also an option.
Thanks! These are really fresh! The fusion of Flux and Dagger was what I had hoped for, doing the low level work for me I’ll try it out with GPUs when I find the time.