Why does the execution time of overlapping GPU and CPU computations not get faster after using the Mem.pin() function?

Hi,

I have a question about using Mem.pin() function in GPU concurrent programming. According to my understanding, the GPU and CPU will block when copying data and cannot perform asynchronous operations. However, CPU blocking operations can be avoided by using Mem.pin() function.

So I tried to write the following two pieces of code, the first piece of code uses the Mem.pin() function, and the second piece of code does not.

  • What confuses me is why using Mem.pin() function does not make the running time of the code faster, but slower than not using the Mem.pin() function.

Thanks for taking the time to read my questions!

using CUDA
using BenchmarkTools

n1 = 10000
n2 = 1200

Agpu =  CUDA.rand(Float64,n1,n1)
Bgpu =  CUDA.rand(Float64,n1)
Cgpu =  CUDA.zeros(Float64,n1)

Dcpu =  rand(Float64,n2,n2)
Dgpu =  CUDA.zeros(Float64,n2,n2)

Ecpu = zeros(Float64,n2,n2)
Egpu =  CUDA.rand(Float64,n2,n2)
# pining Memory 
Mem.pin(Ecpu)

# use Mem.pin() function
@btime @sync begin
    @async begin
        Cgpu .= Agpu*Bgpu 
        synchronize()
    end
    @async begin 
        Egpu .= CuArray(Ecpu) 
    end
    synchronize()
end  # time: 2.780ms

# without using Mem.pin() function
@btime @sync begin
    @async begin
        Cgpu .= Agpu*Bgpu 
        synchronize()
    end
    @async begin 
        Dgpu .= CuArray(Dcpu)
    end
    synchronize()
end # time: 2.768ms

CuArray(Ecpu) is still a synchronous memory operation. You need to use an async copy
Try copyto!(Egpu,Ecpu)

1 Like

There is a difference, namely whether we synchronize before doing the copy: CUDA.jl/array.jl at 594a8b68aabff9ef497c83ebf001138c257dddd5 · JuliaGPU/CUDA.jl · GitHub
But yes, the operation still does a blocking synchronization afterwards.

2 Likes

Actually, I was looking at the wrong function. When copying to the GPU, we don’t need to synchronize afterwards, so using pinned memory should make the operation entirely non-blocking: CUDA.jl/array.jl at 594a8b68aabff9ef497c83ebf001138c257dddd5 · JuliaGPU/CUDA.jl · GitHub. Of course, CUDA may still decide to behave otherwise, so it’s really recommended here to use NSight Systems and have a look at the API call timeline. Memory copies from pinned vs unpinned memory should be colored differently.

2 Likes