Has anyone every used multiple GPUs with Flux? If so, did you have to modify your code to run on multiple GPUs?
Yes, I use Flux with multiple GPUs. But my approach is not elegant.
In CUDAnative (commit f60c4754356225151c866da0ca512b434aa03abd) in src/CUDAnative.jl on line 75 I replaced default_device[] = CuDevice(0)
with dev = haskey(ENV, "CUDADEV") ? parse(ENV["CUDADEV"]) : 0; default_device[] = CuDevice(dev)
, after which I can (in bash) export CUDADEV=1 && julia
to open a julia session with CUDAnative using GPU 1.
You’ll be able to do device!(::CuDevice)
on Julia 0.7 (technically, on any recent CUDAdrv/CUDAnative, but those are Julia 0.7 only). Do note that currently CuArray allocations are tied to a device, for better usability we should probably move to unified memory.
Is there any live examples?
Only https://github.com/JuliaGPU/CUDAnative.jl/blob/master/examples/multigpu.jl, but that uses unified memory which you probably don’t need. You could try and put all code that allocates and execute on the GPU within device!
blocks.
Has there been some prigress wrt multi gpu trainig with Flux in the meantime? A mwe or tutorial would be nice, showing distributed training on multiple nodes with multiple gpus each.
For Flux in particular, you’ll probably want a combination of https://github.com/DhairyaLGandhi/DaggerFlux.jl for the user-facing API and https://github.com/JuliaGPU/DaggerGPU.jl/pull/17 for the GPU transport side of things. If you have access to a CUDA-aware MPI, something like https://github.com/AStupidBear/DistributedFlux.jl is also an option.
Thanks! These are really fresh! The fusion of Flux and Dagger was what I had hoped for, doing the low level work for me I’ll try it out with GPUs when I find the time.