Pmap with multiple GPUs

Hi,
First CUDA.jl is simply amazing and really to use, kudos to you all!
Now in my problem I have multiple GPUs and I would like to parallelize the training of my networks in parallel. So basically have one training of a model per GPU.

I saw that there was an explanation here : https://juliagpu.gitlab.io/CUDA.jl/usage/multigpu/#Scenario-1:-One-GPU-per-process on how to do this more or less.
Without GPUs I would just use pmap from distributed and call a general wrapper training function.
But I have no idea how to combine the approach of the docs with pmap.

Thanks!

Just setup one GPU per process and call pmap. There’s not much more to it.

Oh ok, I thought that the code to run needed to be in the remote_call_wait function.
Now I face another issue. When running the code from the docs, I get :

[ Info: Worker 2 uses CuDevice(0)
ERROR: LoadError: On worker 2:
UndefRefError: access to undefined reference
getindex at ./array.jl:809 [inlined]
context at /home/theo/.julia/packages/CUDA/dZvbp/src/state.jl:242 [inlined]
device! at /home/theo/.julia/packages/CUDA/dZvbp/src/state.jl:286
device! at /home/theo/.julia/packages/CUDA/dZvbp/src/state.jl:265 [inlined]
#32 at /home/theo/experiments/ParticleFlow/julia/scripts/run_swag.jl:17
#110 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:309
run_work_thunk at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:79
run_work_thunk at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:88
#96 at ./task.jl:356
Stacktrace:
 [1] (::Base.var"#770#772")(::Task) at ./asyncmap.jl:178
 [2] foreach(::Base.var"#770#772", ::Array{Any,1}) at ./abstractarray.jl:2009
 [3] maptwice(::Function, ::Channel{Any}, ::Array{Any,1}, ::Base.Iterators.Zip{Tuple{Array{Int64,1},CUDA.DeviceSet}}) at ./asyncmap.jl:178
 [4] wrap_n_exec_twice at ./asyncmap.jl:154 [inlined]
 [5] async_usemap(::var"#31#33", ::Base.Iterators.Zip{Tuple{Array{Int64,1},CUDA.DeviceSet}}; ntasks::Int64, batch_size::Nothing) at ./asyncmap.jl:103
 [6] #asyncmap#754 at ./asyncmap.jl:81 [inlined]
 [7] asyncmap(::Function, ::Base.Iterators.Zip{Tuple{Array{Int64,1},CUDA.DeviceSet}}) at ./asyncmap.jl:8

Note that on my machine I only have one GPU but that I face the same error on the cluster with 8 GPUs.

How did you set the context?

I suppose you mean :

using Distributed, CUDA
addprocs(length(devices()))
@everywhere using CUDA

Ok adding context() in the loop removes the error:

asyncmap((zip(workers(), devices()))) do (p, d)
    remotecall_wait(p) do
        @info "Worker $p uses $d"
        context()
        device!(d)
    end
end

Is that the solution?

Glad you found a workaround. This is a bug though, please file an issue :slightly_smiling_face:

Ah no sorry I tried it now on the cluster and got the following error :

[ Info: Worker 3 uses CuDevice(1)
[ Info: Worker 2 uses CuDevice(0)
[ Info: Worker 4 uses CuDevice(2)
[ Info: Worker 7 uses CuDevice(5)
[ Info: Worker 8 uses CuDevice(6)
[ Info: Worker 9 uses CuDevice(7)
[ Info: Worker 6 uses CuDevice(4)
[ Info: Worker 5 uses CuDevice(3)
ERROR: LoadError: On worker 2:
CUDA error: out of memory (code 2, ERROR_OUT_OF_MEMORY)
throw_api_error at /home/ubuntu/.julia/packages/CUDA/dZvbp/lib/cudadrv/error.jl:103
macro expansion at /home/ubuntu/.julia/packages/CUDA/dZvbp/lib/cudadrv/error.jl:110 [inlined]
cuDevicePrimaryCtxRetain at /home/ubuntu/.julia/packages/CUDA/dZvbp/lib/utils/call.jl:93
CuContext at /home/ubuntu/.julia/packages/CUDA/dZvbp/lib/cudadrv/context/primary.jl:31 [inlined]
context at /home/ubuntu/.julia/packages/CUDA/dZvbp/src/state.jl:249 [inlined]
device! at /home/ubuntu/.julia/packages/CUDA/dZvbp/src/state.jl:286
device! at /home/ubuntu/.julia/packages/CUDA/dZvbp/src/state.jl:265 [inlined]
#10 at /home/ubuntu/ParticleFlow_Exp/julia/scripts/run_swag.jl:18
#110 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:309
run_work_thunk at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:79
run_work_thunk at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:88
#96 at ./task.jl:356

Using context() outside of the loop did not help either

I can’t reproduce the pmap issue, but let’s keep that discussion in the issue.

For the other error: you’re running out of GPU memory, nothing we can do about that. Same for the CUBLAS initialization error you reported on Slack, that probably happens because of high memory pressure (https://github.com/JuliaGPU/CUDA.jl/issues/340).

I did manage to reproduce the original issue, fixed here: https://github.com/JuliaGPU/CUDA.jl/pull/471. This only occurs when passing a CuDevice to a new process, so doesn’t have any other impact. As a workaround, calling context() as you did is sufficient.