Hi,
I have a question about using Mem.pin() function in GPU concurrent programming. According to my understanding, the GPU and CPU will block when copying data and cannot perform asynchronous operations. However, CPU blocking operations can be avoided by using Mem.pin() function.
So I tried to write the following two pieces of code, the first piece of code uses the Mem.pin() function, and the second piece of code does not.
- What confuses me is why using Mem.pin() function does not make the running time of the code faster, but slower than not using the Mem.pin() function.
Thanks for taking the time to read my questions!
using CUDA
using BenchmarkTools
n1 = 10000
n2 = 1200
Agpu = CUDA.rand(Float64,n1,n1)
Bgpu = CUDA.rand(Float64,n1)
Cgpu = CUDA.zeros(Float64,n1)
Dcpu = rand(Float64,n2,n2)
Dgpu = CUDA.zeros(Float64,n2,n2)
Ecpu = zeros(Float64,n2,n2)
Egpu = CUDA.rand(Float64,n2,n2)
# pining Memory
Mem.pin(Ecpu)
# use Mem.pin() function
@btime @sync begin
@async begin
Cgpu .= Agpu*Bgpu
synchronize()
end
@async begin
Egpu .= CuArray(Ecpu)
end
synchronize()
end # time: 2.780ms
# without using Mem.pin() function
@btime @sync begin
@async begin
Cgpu .= Agpu*Bgpu
synchronize()
end
@async begin
Dgpu .= CuArray(Dcpu)
end
synchronize()
end # time: 2.768ms