 # How to use multiple GPUs correctly?

Hi all,

I am trying to change my GPU code to use multiple GPUs due to the memory limit of a single GPU. I found an example (https://www.juliabloggers.com/multiple-gpu-parallelism-on-the-hpc-with-julia/) and the basic idea is

1. split the whole data into different parts.
2. store each part of the data using CuArrays in different GPU cards.
3. launch the kernels asynchronously in different GPUs with local data.

So I have done a test with the following code

``````using CuArrays, CUDAnative
using BenchmarkTools

N = 2
n = 2^20
A = Vector{Any}(undef, N)
B = Vector{Any}(undef, N)
C = Vector{Any}(undef, N)

i = (blockIdx().x-1) * blockDim().x + threadIdx().x
for j = 1:10000
out[i] = CUDAnative.cos(a[i]) + CUDAnative.sin(b[i])
end
return nothing
end

for i in 1:N
device!(i-1)
A[i] = CuArray(rand(n))
B[i] = CuArray(rand(n))
C[i] = CuArray(rand(n))
end

function test()
@sync begin
for i in 1:1
@async begin
device!(i-1)
end
end
end
return nothing
end

function test2()
@sync begin
for i in 1:2
@async begin
device!(i-1)
end
end
end
return nothing
end

@btime test()
@btime test2()
``````

which gives

``````  14.252 μs (64 allocations: 2.83 KiB)
25.586 μs (120 allocations: 5.28 KiB)
``````

As we can see, the time of test2() in which two GPUs are used is almost 2 times of that for function test() where only 1 GPU is used. Ideally, I hope the functions test2() and test() have similar running time. How to achieve this? Many thanks.

1 Like

You’re just timing the time to launch a kernel here, since `Base.@sync` doesn’t synchronize the GPU. Similarly, there’s no need for calling `Base.@async` since GPU operations are mostly asynchronous already. You’ll need to have another loop over the devices and `CUDAdrv.synchronize()` (or fetch back the results) to actually see a difference. See e.g. https://github.com/JuliaGPU/CUDAnative.jl/blob/v2.0.0/examples/multigpu.jl

That said, single-process-multiple-gpu with Julia+CUDA isn’t a polished story today, so you might be better off using a single GPU per process and e.g. CUDA-aware MPI.

1 Like

Many thanks for your answers and suggestions, I will try CUDA-aware MPI.