Why does the execution time of overlapping GPU and CPU computations not get faster after using the Mem.pin() function?

fred_wu · May 3, 2023, 2:17pm

Hi,

I have a question about using Mem.pin() function in GPU concurrent programming. According to my understanding, the GPU and CPU will block when copying data and cannot perform asynchronous operations. However, CPU blocking operations can be avoided by using Mem.pin() function.

So I tried to write the following two pieces of code, the first piece of code uses the Mem.pin() function, and the second piece of code does not.

What confuses me is why using Mem.pin() function does not make the running time of the code faster, but slower than not using the Mem.pin() function.

Thanks for taking the time to read my questions!

using CUDA
using BenchmarkTools

n1 = 10000
n2 = 1200

Agpu =  CUDA.rand(Float64,n1,n1)
Bgpu =  CUDA.rand(Float64,n1)
Cgpu =  CUDA.zeros(Float64,n1)

Dcpu =  rand(Float64,n2,n2)
Dgpu =  CUDA.zeros(Float64,n2,n2)

Ecpu = zeros(Float64,n2,n2)
Egpu =  CUDA.rand(Float64,n2,n2)
# pining Memory 
Mem.pin(Ecpu)

# use Mem.pin() function
@btime @sync begin
    @async begin
        Cgpu .= Agpu*Bgpu 
        synchronize()
    end
    @async begin 
        Egpu .= CuArray(Ecpu) 
    end
    synchronize()
end  # time: 2.780ms

# without using Mem.pin() function
@btime @sync begin
    @async begin
        Cgpu .= Agpu*Bgpu 
        synchronize()
    end
    @async begin 
        Dgpu .= CuArray(Dcpu)
    end
    synchronize()
end # time: 2.768ms

vchuravy · May 4, 2023, 3:07am

CuArray(Ecpu) is still a synchronous memory operation. You need to use an async copy
Try copyto!(Egpu,Ecpu)

maleadt · May 4, 2023, 6:01am

There is a difference, namely whether we synchronize before doing the copy: CUDA.jl/array.jl at 594a8b68aabff9ef497c83ebf001138c257dddd5 · JuliaGPU/CUDA.jl · GitHub
But yes, the operation still does a blocking synchronization afterwards.

maleadt · May 5, 2023, 9:24am

Actually, I was looking at the wrong function. When copying to the GPU, we don’t need to synchronize afterwards, so using pinned memory should make the operation entirely non-blocking: CUDA.jl/array.jl at 594a8b68aabff9ef497c83ebf001138c257dddd5 · JuliaGPU/CUDA.jl · GitHub. Of course, CUDA may still decide to behave otherwise, so it’s really recommended here to use NSight Systems and have a look at the API call timeline. Memory copies from pinned vs unpinned memory should be colored differently.

Topic		Replies	Views
Questions about using CUDA.jl for GPU concurrent programming: Computational results cannot be obtained when overlapping GPU and CPU operations GPU question	2	426	April 12, 2023
How to perform GPU overlap operations on the custom kernel function? GPU question	8	619	August 26, 2023
CuArray/CUDAnative argmin paradoxical performance GPU	2	844	January 31, 2019
Timing square function in CUDA GPU	4	1659	December 11, 2018
Can I move an array asynchronously from main program to CUDA? GPU gpu , gpuarrays , cuda	7	190	December 15, 2024

Why does the execution time of overlapping GPU and CPU computations not get faster after using the Mem.pin() function?

Related topics