How to use multiple GPUs correctly?

Hi all,

I am trying to change my GPU code to use multiple GPUs due to the memory limit of a single GPU. I found an example (Multiple-GPU Parallelism on the HPC with Julia | juliabloggers.com) and the basic idea is

  1. split the whole data into different parts.
  2. store each part of the data using CuArrays in different GPU cards.
  3. launch the kernels asynchronously in different GPUs with local data.

So I have done a test with the following code

using CuArrays, CUDAnative
using BenchmarkTools

N = 2
n = 2^20
A = Vector{Any}(undef, N)
B = Vector{Any}(undef, N)
C = Vector{Any}(undef, N)

function kernel_vadd(out, a, b)
  i = (blockIdx().x-1) * blockDim().x + threadIdx().x
  for j = 1:10000
      out[i] = CUDAnative.cos(a[i]) + CUDAnative.sin(b[i])
  end
  return nothing
end

for i in 1:N
    device!(i-1)
    A[i] = CuArray(rand(n))
    B[i] = CuArray(rand(n))
    C[i] = CuArray(rand(n))
end

function test()
    @sync begin
    for i in 1:1
        @async begin
            device!(i-1)
            @cuda threads=4 kernel_vadd(C[i], A[i], B[i])
        end
    end
    end
    return nothing
end

function test2()
    @sync begin
    for i in 1:2
        @async begin
            device!(i-1)
            @cuda threads=4 kernel_vadd(C[i], A[i], B[i])
        end
    end
    end
    return nothing
end

@btime test()
@btime test2()

which gives

  14.252 μs (64 allocations: 2.83 KiB)
  25.586 μs (120 allocations: 5.28 KiB)

As we can see, the time of test2() in which two GPUs are used is almost 2 times of that for function test() where only 1 GPU is used. Ideally, I hope the functions test2() and test() have similar running time. How to achieve this? Many thanks.

1 Like

You’re just timing the time to launch a kernel here, since Base.@sync doesn’t synchronize the GPU. Similarly, there’s no need for calling Base.@async since GPU operations are mostly asynchronous already. You’ll need to have another loop over the devices and CUDAdrv.synchronize() (or fetch back the results) to actually see a difference. See e.g. https://github.com/JuliaGPU/CUDAnative.jl/blob/v2.0.0/examples/multigpu.jl

That said, single-process-multiple-gpu with Julia+CUDA isn’t a polished story today, so you might be better off using a single GPU per process and e.g. CUDA-aware MPI.

1 Like

Many thanks for your answers and suggestions, I will try CUDA-aware MPI.