Pmap with multiple GPUs

Hi,
First CUDA.jl is simply amazing and really to use, kudos to you all!
Now in my problem I have multiple GPUs and I would like to parallelize the training of my networks in parallel. So basically have one training of a model per GPU.

I saw that there was an explanation here : https://juliagpu.gitlab.io/CUDA.jl/usage/multigpu/#Scenario-1:-One-GPU-per-process on how to do this more or less.
Without GPUs I would just use pmap from distributed and call a general wrapper training function.
But I have no idea how to combine the approach of the docs with pmap.

Thanks!

Just setup one GPU per process and call pmap. There’s not much more to it.

1 Like

Oh ok, I thought that the code to run needed to be in the remote_call_wait function.
Now I face another issue. When running the code from the docs, I get :

[ Info: Worker 2 uses CuDevice(0)
ERROR: LoadError: On worker 2:
UndefRefError: access to undefined reference
getindex at ./array.jl:809 [inlined]
context at /home/theo/.julia/packages/CUDA/dZvbp/src/state.jl:242 [inlined]
device! at /home/theo/.julia/packages/CUDA/dZvbp/src/state.jl:286
device! at /home/theo/.julia/packages/CUDA/dZvbp/src/state.jl:265 [inlined]
#32 at /home/theo/experiments/ParticleFlow/julia/scripts/run_swag.jl:17
#110 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:309
run_work_thunk at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:79
run_work_thunk at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:88
#96 at ./task.jl:356
Stacktrace:
 [1] (::Base.var"#770#772")(::Task) at ./asyncmap.jl:178
 [2] foreach(::Base.var"#770#772", ::Array{Any,1}) at ./abstractarray.jl:2009
 [3] maptwice(::Function, ::Channel{Any}, ::Array{Any,1}, ::Base.Iterators.Zip{Tuple{Array{Int64,1},CUDA.DeviceSet}}) at ./asyncmap.jl:178
 [4] wrap_n_exec_twice at ./asyncmap.jl:154 [inlined]
 [5] async_usemap(::var"#31#33", ::Base.Iterators.Zip{Tuple{Array{Int64,1},CUDA.DeviceSet}}; ntasks::Int64, batch_size::Nothing) at ./asyncmap.jl:103
 [6] #asyncmap#754 at ./asyncmap.jl:81 [inlined]
 [7] asyncmap(::Function, ::Base.Iterators.Zip{Tuple{Array{Int64,1},CUDA.DeviceSet}}) at ./asyncmap.jl:8

Note that on my machine I only have one GPU but that I face the same error on the cluster with 8 GPUs.

How did you set the context?

I suppose you mean :

using Distributed, CUDA
addprocs(length(devices()))
@everywhere using CUDA

Ok adding context() in the loop removes the error:

asyncmap((zip(workers(), devices()))) do (p, d)
    remotecall_wait(p) do
        @info "Worker $p uses $d"
        context()
        device!(d)
    end
end

Is that the solution?

1 Like

Glad you found a workaround. This is a bug though, please file an issue :slightly_smiling_face:

Ah no sorry I tried it now on the cluster and got the following error :

[ Info: Worker 3 uses CuDevice(1)
[ Info: Worker 2 uses CuDevice(0)
[ Info: Worker 4 uses CuDevice(2)
[ Info: Worker 7 uses CuDevice(5)
[ Info: Worker 8 uses CuDevice(6)
[ Info: Worker 9 uses CuDevice(7)
[ Info: Worker 6 uses CuDevice(4)
[ Info: Worker 5 uses CuDevice(3)
ERROR: LoadError: On worker 2:
CUDA error: out of memory (code 2, ERROR_OUT_OF_MEMORY)
throw_api_error at /home/ubuntu/.julia/packages/CUDA/dZvbp/lib/cudadrv/error.jl:103
macro expansion at /home/ubuntu/.julia/packages/CUDA/dZvbp/lib/cudadrv/error.jl:110 [inlined]
cuDevicePrimaryCtxRetain at /home/ubuntu/.julia/packages/CUDA/dZvbp/lib/utils/call.jl:93
CuContext at /home/ubuntu/.julia/packages/CUDA/dZvbp/lib/cudadrv/context/primary.jl:31 [inlined]
context at /home/ubuntu/.julia/packages/CUDA/dZvbp/src/state.jl:249 [inlined]
device! at /home/ubuntu/.julia/packages/CUDA/dZvbp/src/state.jl:286
device! at /home/ubuntu/.julia/packages/CUDA/dZvbp/src/state.jl:265 [inlined]
#10 at /home/ubuntu/ParticleFlow_Exp/julia/scripts/run_swag.jl:18
#110 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:309
run_work_thunk at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:79
run_work_thunk at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:88
#96 at ./task.jl:356

Using context() outside of the loop did not help either

I can’t reproduce the pmap issue, but let’s keep that discussion in the issue.

For the other error: you’re running out of GPU memory, nothing we can do about that. Same for the CUBLAS initialization error you reported on Slack, that probably happens because of high memory pressure (https://github.com/JuliaGPU/CUDA.jl/issues/340).

I did manage to reproduce the original issue, fixed here: https://github.com/JuliaGPU/CUDA.jl/pull/471. This only occurs when passing a CuDevice to a new process, so doesn’t have any other impact. As a workaround, calling context() as you did is sufficient.