Hi all,

I am trying to change my GPU code to use multiple GPUs due to the memory limit of a single GPU. I found an example (https://www.juliabloggers.com/multiple-gpu-parallelism-on-the-hpc-with-julia/) and the basic idea is

- split the whole data into different parts.
- store each part of the data using CuArrays in different GPU cards.
- launch the kernels asynchronously in different GPUs with local data.

So I have done a test with the following code

```
using CuArrays, CUDAnative
using BenchmarkTools
N = 2
n = 2^20
A = Vector{Any}(undef, N)
B = Vector{Any}(undef, N)
C = Vector{Any}(undef, N)
function kernel_vadd(out, a, b)
i = (blockIdx().x-1) * blockDim().x + threadIdx().x
for j = 1:10000
out[i] = CUDAnative.cos(a[i]) + CUDAnative.sin(b[i])
end
return nothing
end
for i in 1:N
device!(i-1)
A[i] = CuArray(rand(n))
B[i] = CuArray(rand(n))
C[i] = CuArray(rand(n))
end
function test()
@sync begin
for i in 1:1
@async begin
device!(i-1)
@cuda threads=4 kernel_vadd(C[i], A[i], B[i])
end
end
end
return nothing
end
function test2()
@sync begin
for i in 1:2
@async begin
device!(i-1)
@cuda threads=4 kernel_vadd(C[i], A[i], B[i])
end
end
end
return nothing
end
@btime test()
@btime test2()
```

which gives

```
14.252 μs (64 allocations: 2.83 KiB)
25.586 μs (120 allocations: 5.28 KiB)
```

As we can see, the time of test2() in which two GPUs are used is almost 2 times of that for function test() where only 1 GPU is used. Ideally, I hope the functions test2() and test() have similar running time. How to achieve this? Many thanks.