Questions about using CUDA.jl for GPU concurrent programming: Computational results cannot be obtained when overlapping GPU and CPU operations

fred_wu · April 10, 2023, 8:51am

Hi，
I’m new to julia parallel programming.I want to make the idle CPU work together when the GPU is doing calculations, so I wrote the following code. However, what confuses me is that I can’t get the calculation result.

using CUDA
ngpu = 10000
ncpu = 3000

Acpu  =  rand(Float64,ncpu,ncpu)
Bcpu  =  rand(Float64,ncpu)
Ccpu  =  zeros(Float64,ncpu)

Agpu =  CUDA.rand(Float64,ngpu,ngpu)
Bgpu =  CUDA.rand(Float64,ngpu)
Cgpu =  CUDA.zeros(Float64,ngpu)

@sync begin
    @async begin
        Cgpu = Agpu*Bgpu
        synchronize() 
    end
    @async begin 
        Ccpu = Acpu*Bcpu 
    end
    synchronize()
end

Compared with the operation without overlap GPU with CPU operations, the time is indeed shortened. However, when I checked the value of Ccpu and the value of Cgpu, I found that there was no result and it was still zero.

julia> Cgpu
10000-element CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}:
 0.0
 0.0
 0.0
 ⋮
 0.0

julia> Ccpu
3000-element Vector{Float64}:
 0.0
 0.0
 0.0
 ⋮
 0.0

After that, I tried to write GPU and CPU codes by myself and found that I could get calculation results.
The GPU function is as follows:

function MatrixVectorMul!(Agpu,Bgpu,Cgpu)
    it = (blockIdx().x-1) * blockDim().x + threadIdx().x
    num = size(Agpu,1) 
    if  it >   num 
        return
    end

    for  i = 1:num
        Cgpu[it] =  Cgpu[it] + Agpu[it,i]*Bgpu[i]
    end
    return
end

The CPU function is as follows:

function MatrixVectorMulcpu!(Acpu,Bcpu,Ccpu)
    num = size(Acpu,1)
    for i = 1:num
        for j = 1:num
            Ccpu[j] =  Ccpu[j] + Acpu[j,i]*Bcpu[i]
        end 
    end
end

Tried overlaping GPU with CPU operations as before:

ngpu1 = 10000
ngpu2= 5000

Agpu1 =  CUDA.rand(Float64,ngpu1,ngpu1)
Bgpu1 =  CUDA.rand(Float64,ngpu1)

Agpu2 =  CUDA.rand(Float64,ngpu2,ngpu2)
Bgpu2 =  CUDA.rand(Float64,ngpu2)

Cgpu1 =  CUDA.zeros(Float64,ngpu1)
Cgpu2 =  CUDA.zeros(Float64,ngpu2)

@sync begin
    @async begin
        CUDA.@sync @cuda(
            threads = 256, 
            blocks = cld(size(Agpu,1),256), 
            MatrixVectorMul!(Agpu,Bgpu,Cgpu)
        )
    end
    @async begin 
        MatrixVectorMulcpu!(Acpu,Bcpu,Ccpu)
    end
    synchronize()
end

This time I got the result：

julia> Cgpu
10000-element CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}:
 2519.9909927774434
 2506.254771662522
 2519.7125246753635
 2524.0248509403823
 2488.2238108879537
 2487.392467114674
 2514.433696898406
 2527.249667217526
 2524.7751937008948
 2490.0621013787577
 2529.0631261691633
 2474.905377092883
 2507.2416820931517
    ⋮
 2507.195103329959

julia> Ccpu
3000-element Vector{Float64}:
 738.0790280921476
 747.322912434719
 753.8496326244853
 772.8891677414435
 752.6666222077597
 743.9383860237455
 755.9752339965773
 742.7187969085796
 755.212427805986
 748.8083971634609
 748.226751429639
 759.2037226616454
 739.5902158523694
   ⋮
 748.9861236734295

I want to ask if I’m missing something while programming, or if this could be a bug? When concurrent programming overlaps the operations of the GPU and the CPU, the calculation results cannot be obtained by using the built-in functions of julia or CUDA.jl. Only the functions written by oneself can be used to obtain the calculation results?
What I also want to ask is whether the GPU can use idle threads to calculate other kernel functions when it is calculating a kernel function? If so how should the code be written?

I have made the following attempts and still can’t get the result, and the calculation time has not been shortened.

using CUDA

ngpu1 = 10000
ngpu2= 5000

Agpu1 =  CUDA.rand(Float64,ngpu1,ngpu1)
Bgpu1 =  CUDA.rand(Float64,ngpu1)

Agpu2 =  CUDA.rand(Float64,ngpu2,ngpu2)
Bgpu2 =  CUDA.rand(Float64,ngpu2)

Cgpu1 =  CUDA.zeros(Float64,ngpu1)
Cgpu2 =  CUDA.zeros(Float64,ngpu2)

begin
    @async begin
        Cgpu1 = Agpu1*Bgpu1
        synchronize() 
    end
    @async begin 
        Cgpu2 = Agpu2*Bgpu2
        synchronize()
    end
end

julia> Cgpu1
10000-element CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}:
 0.0
 0.0
 0.0
 ⋮
 0.0

julia> Cgpu2
5000-element CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}:
 0.0
 0.0
 0.0
 ⋮
 0.0

julia version : 1.7.2
CUDA.jl version: v4.0.1

Thanks for taking the time to read my questions.

jpsamaroo · April 10, 2023, 12:47pm

This is not doing what you think it’s doing; this syntax only creates the variable Cgpu in the scope of the @async task, but that is importantly not the same array that you allocated at the top of your script. In other words, you’re not operating in-place on the original Cgpu, but instead creating a new one which is only scoped to the @async task, so you also can’t access it from outside the @async.

What you really want is Cgpu .= Agpu * Bgpu, to ensure that the result of Agpu * Bgpu is written directly into Cgpu, instead of allocating a new array.

fred_wu · April 12, 2023, 7:21am

Thank you very much for your detailed answer！

Topic		Replies	Views
How to accelerate GPU operation？ GPU question	12	321	November 18, 2024
One example from `GPU programming in Julia \| Workshop \| JuliaCon 2021` GPU question , gpu	0	358	April 5, 2022
CuArray local scope memory issue GPU	4	294	January 4, 2023
CUDA example in Julia doesn't use the GPU General Usage cuda	2	1190	February 5, 2022
CUDA.jl - Multiple Threads to Initiate Same CUDA Algorithm GPU parallel , multithreading , cuda , concurrency	3	1652	April 26, 2022

Questions about using CUDA.jl for GPU concurrent programming: Computational results cannot be obtained when overlapping GPU and CPU operations

Related topics