Questions about using CUDA.jl for GPU concurrent programming: Computational results cannot be obtained when overlapping GPU and CPU operations

Hi,
I’m new to julia parallel programming.I want to make the idle CPU work together when the GPU is doing calculations, so I wrote the following code. However, what confuses me is that I can’t get the calculation result.

using CUDA
ngpu = 10000
ncpu = 3000

Acpu  =  rand(Float64,ncpu,ncpu)
Bcpu  =  rand(Float64,ncpu)
Ccpu  =  zeros(Float64,ncpu)

Agpu =  CUDA.rand(Float64,ngpu,ngpu)
Bgpu =  CUDA.rand(Float64,ngpu)
Cgpu =  CUDA.zeros(Float64,ngpu)

@sync begin
    @async begin
        Cgpu = Agpu*Bgpu
        synchronize() 
    end
    @async begin 
        Ccpu = Acpu*Bcpu 
    end
    synchronize()
end 

Compared with the operation without overlap GPU with CPU operations, the time is indeed shortened. However, when I checked the value of Ccpu and the value of Cgpu, I found that there was no result and it was still zero.

julia> Cgpu
10000-element CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}:
 0.0
 0.0
 0.0
 ⋮
 0.0

julia> Ccpu
3000-element Vector{Float64}:
 0.0
 0.0
 0.0
 ⋮
 0.0

After that, I tried to write GPU and CPU codes by myself and found that I could get calculation results.
The GPU function is as follows:

function MatrixVectorMul!(Agpu,Bgpu,Cgpu)
    it = (blockIdx().x-1) * blockDim().x + threadIdx().x
    num = size(Agpu,1) 
    if  it >   num 
        return
    end

    for  i = 1:num
        Cgpu[it] =  Cgpu[it] + Agpu[it,i]*Bgpu[i]
    end
    return
end

The CPU function is as follows:

function MatrixVectorMulcpu!(Acpu,Bcpu,Ccpu)
    num = size(Acpu,1)
    for i = 1:num
        for j = 1:num
            Ccpu[j] =  Ccpu[j] + Acpu[j,i]*Bcpu[i]
        end 
    end
end

Tried overlaping GPU with CPU operations as before:

ngpu1 = 10000
ngpu2= 5000

Agpu1 =  CUDA.rand(Float64,ngpu1,ngpu1)
Bgpu1 =  CUDA.rand(Float64,ngpu1)

Agpu2 =  CUDA.rand(Float64,ngpu2,ngpu2)
Bgpu2 =  CUDA.rand(Float64,ngpu2)

Cgpu1 =  CUDA.zeros(Float64,ngpu1)
Cgpu2 =  CUDA.zeros(Float64,ngpu2)

@sync begin
    @async begin
        CUDA.@sync @cuda(
            threads = 256, 
            blocks = cld(size(Agpu,1),256), 
            MatrixVectorMul!(Agpu,Bgpu,Cgpu)
        )
    end
    @async begin 
        MatrixVectorMulcpu!(Acpu,Bcpu,Ccpu)
    end
    synchronize()
end

This time I got the result:

julia> Cgpu
10000-element CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}:
 2519.9909927774434
 2506.254771662522
 2519.7125246753635
 2524.0248509403823
 2488.2238108879537
 2487.392467114674
 2514.433696898406
 2527.249667217526
 2524.7751937008948
 2490.0621013787577
 2529.0631261691633
 2474.905377092883
 2507.2416820931517
    ⋮
 2507.195103329959

julia> Ccpu
3000-element Vector{Float64}:
 738.0790280921476
 747.322912434719
 753.8496326244853
 772.8891677414435
 752.6666222077597
 743.9383860237455
 755.9752339965773
 742.7187969085796
 755.212427805986
 748.8083971634609
 748.226751429639
 759.2037226616454
 739.5902158523694
   ⋮
 748.9861236734295
  • I want to ask if I’m missing something while programming, or if this could be a bug? When concurrent programming overlaps the operations of the GPU and the CPU, the calculation results cannot be obtained by using the built-in functions of julia or CUDA.jl. Only the functions written by oneself can be used to obtain the calculation results?

  • What I also want to ask is whether the GPU can use idle threads to calculate other kernel functions when it is calculating a kernel function? If so how should the code be written?

I have made the following attempts and still can’t get the result, and the calculation time has not been shortened.

using CUDA

ngpu1 = 10000
ngpu2= 5000

Agpu1 =  CUDA.rand(Float64,ngpu1,ngpu1)
Bgpu1 =  CUDA.rand(Float64,ngpu1)

Agpu2 =  CUDA.rand(Float64,ngpu2,ngpu2)
Bgpu2 =  CUDA.rand(Float64,ngpu2)

Cgpu1 =  CUDA.zeros(Float64,ngpu1)
Cgpu2 =  CUDA.zeros(Float64,ngpu2)

begin
    @async begin
        Cgpu1 = Agpu1*Bgpu1
        synchronize() 
    end
    @async begin 
        Cgpu2 = Agpu2*Bgpu2
        synchronize()
    end
end
julia> Cgpu1
10000-element CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}:
 0.0
 0.0
 0.0
 ⋮
 0.0

julia> Cgpu2
5000-element CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}:
 0.0
 0.0
 0.0
 ⋮
 0.0

julia version : 1.7.2
CUDA.jl version: v4.0.1

Thanks for taking the time to read my questions.

This is not doing what you think it’s doing; this syntax only creates the variable Cgpu in the scope of the @async task, but that is importantly not the same array that you allocated at the top of your script. In other words, you’re not operating in-place on the original Cgpu, but instead creating a new one which is only scoped to the @async task, so you also can’t access it from outside the @async.

What you really want is Cgpu .= Agpu * Bgpu, to ensure that the result of Agpu * Bgpu is written directly into Cgpu, instead of allocating a new array.

1 Like

Thank you very much for your detailed answer!