How to use multiple GPUs correctly?

ww1g11 · October 16, 2019, 2:36pm

Hi all,

I am trying to change my GPU code to use multiple GPUs due to the memory limit of a single GPU. I found an example (Multiple-GPU Parallelism on the HPC with Julia | juliabloggers.com) and the basic idea is

split the whole data into different parts.
store each part of the data using CuArrays in different GPU cards.
launch the kernels asynchronously in different GPUs with local data.

So I have done a test with the following code

using CuArrays, CUDAnative
using BenchmarkTools

N = 2
n = 2^20
A = Vector{Any}(undef, N)
B = Vector{Any}(undef, N)
C = Vector{Any}(undef, N)

function kernel_vadd(out, a, b)
  i = (blockIdx().x-1) * blockDim().x + threadIdx().x
  for j = 1:10000
      out[i] = CUDAnative.cos(a[i]) + CUDAnative.sin(b[i])
  end
  return nothing
end

for i in 1:N
    device!(i-1)
    A[i] = CuArray(rand(n))
    B[i] = CuArray(rand(n))
    C[i] = CuArray(rand(n))
end

function test()
    @sync begin
    for i in 1:1
        @async begin
            device!(i-1)
            @cuda threads=4 kernel_vadd(C[i], A[i], B[i])
        end
    end
    end
    return nothing
end

function test2()
    @sync begin
    for i in 1:2
        @async begin
            device!(i-1)
            @cuda threads=4 kernel_vadd(C[i], A[i], B[i])
        end
    end
    end
    return nothing
end

@btime test()
@btime test2()

which gives

  14.252 μs (64 allocations: 2.83 KiB)
  25.586 μs (120 allocations: 5.28 KiB)

As we can see, the time of test2() in which two GPUs are used is almost 2 times of that for function test() where only 1 GPU is used. Ideally, I hope the functions test2() and test() have similar running time. How to achieve this? Many thanks.

maleadt · October 16, 2019, 2:56pm

You’re just timing the time to launch a kernel here, since Base.@sync doesn’t synchronize the GPU. Similarly, there’s no need for calling Base.@async since GPU operations are mostly asynchronous already. You’ll need to have another loop over the devices and CUDAdrv.synchronize() (or fetch back the results) to actually see a difference. See e.g. https://github.com/JuliaGPU/CUDAnative.jl/blob/v2.0.0/examples/multigpu.jl

That said, single-process-multiple-gpu with Julia+CUDA isn’t a polished story today, so you might be better off using a single GPU per process and e.g. CUDA-aware MPI.

ww1g11 · October 16, 2019, 3:12pm

Many thanks for your answers and suggestions, I will try CUDA-aware MPI.

Topic		Replies	Views
Multiple GPUs with Julia GPU announcement	8	2934	August 9, 2021
Allocating different arrays on multiple GPUs GPU	8	1036	September 30, 2021
CUDAnative use multiple GPUs GPU gpu , cudanative , parallel	5	1766	March 24, 2018
multiple-GPUs per process GPU	3	341	April 27, 2023
Understanding GPU Kernels GPU	4	2585	April 10, 2018

How to use multiple GPUs correctly?

Related topics