Hi all,
I am trying to change my GPU code to use multiple GPUs due to the memory limit of a single GPU. I found an example (Multiple-GPU Parallelism on the HPC with Julia | juliabloggers.com) and the basic idea is
- split the whole data into different parts.
- store each part of the data using CuArrays in different GPU cards.
- launch the kernels asynchronously in different GPUs with local data.
So I have done a test with the following code
using CuArrays, CUDAnative
using BenchmarkTools
N = 2
n = 2^20
A = Vector{Any}(undef, N)
B = Vector{Any}(undef, N)
C = Vector{Any}(undef, N)
function kernel_vadd(out, a, b)
i = (blockIdx().x-1) * blockDim().x + threadIdx().x
for j = 1:10000
out[i] = CUDAnative.cos(a[i]) + CUDAnative.sin(b[i])
end
return nothing
end
for i in 1:N
device!(i-1)
A[i] = CuArray(rand(n))
B[i] = CuArray(rand(n))
C[i] = CuArray(rand(n))
end
function test()
@sync begin
for i in 1:1
@async begin
device!(i-1)
@cuda threads=4 kernel_vadd(C[i], A[i], B[i])
end
end
end
return nothing
end
function test2()
@sync begin
for i in 1:2
@async begin
device!(i-1)
@cuda threads=4 kernel_vadd(C[i], A[i], B[i])
end
end
end
return nothing
end
@btime test()
@btime test2()
which gives
14.252 μs (64 allocations: 2.83 KiB)
25.586 μs (120 allocations: 5.28 KiB)
As we can see, the time of test2() in which two GPUs are used is almost 2 times of that for function test() where only 1 GPU is used. Ideally, I hope the functions test2() and test() have similar running time. How to achieve this? Many thanks.