Hi，
I’m new to julia parallel programming.I want to make the idle CPU work together when the GPU is doing calculations, so I wrote the following code. However, what confuses me is that I can’t get the calculation result.
using CUDA
ngpu = 10000
ncpu = 3000
Acpu = rand(Float64,ncpu,ncpu)
Bcpu = rand(Float64,ncpu)
Ccpu = zeros(Float64,ncpu)
Agpu = CUDA.rand(Float64,ngpu,ngpu)
Bgpu = CUDA.rand(Float64,ngpu)
Cgpu = CUDA.zeros(Float64,ngpu)
@sync begin
@async begin
Cgpu = Agpu*Bgpu
synchronize()
end
@async begin
Ccpu = Acpu*Bcpu
end
synchronize()
end
Compared with the operation without overlap GPU with CPU operations, the time is indeed shortened. However, when I checked the value of Ccpu and the value of Cgpu, I found that there was no result and it was still zero.
julia> Cgpu
10000element CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}:
0.0
0.0
0.0
⋮
0.0
julia> Ccpu
3000element Vector{Float64}:
0.0
0.0
0.0
⋮
0.0
After that, I tried to write GPU and CPU codes by myself and found that I could get calculation results.
The GPU function is as follows:
function MatrixVectorMul!(Agpu,Bgpu,Cgpu)
it = (blockIdx().x1) * blockDim().x + threadIdx().x
num = size(Agpu,1)
if it > num
return
end
for i = 1:num
Cgpu[it] = Cgpu[it] + Agpu[it,i]*Bgpu[i]
end
return
end
The CPU function is as follows:
function MatrixVectorMulcpu!(Acpu,Bcpu,Ccpu)
num = size(Acpu,1)
for i = 1:num
for j = 1:num
Ccpu[j] = Ccpu[j] + Acpu[j,i]*Bcpu[i]
end
end
end
Tried overlaping GPU with CPU operations as before:
ngpu1 = 10000
ngpu2= 5000
Agpu1 = CUDA.rand(Float64,ngpu1,ngpu1)
Bgpu1 = CUDA.rand(Float64,ngpu1)
Agpu2 = CUDA.rand(Float64,ngpu2,ngpu2)
Bgpu2 = CUDA.rand(Float64,ngpu2)
Cgpu1 = CUDA.zeros(Float64,ngpu1)
Cgpu2 = CUDA.zeros(Float64,ngpu2)
@sync begin
@async begin
CUDA.@sync @cuda(
threads = 256,
blocks = cld(size(Agpu,1),256),
MatrixVectorMul!(Agpu,Bgpu,Cgpu)
)
end
@async begin
MatrixVectorMulcpu!(Acpu,Bcpu,Ccpu)
end
synchronize()
end
This time I got the result：
julia> Cgpu
10000element CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}:
2519.9909927774434
2506.254771662522
2519.7125246753635
2524.0248509403823
2488.2238108879537
2487.392467114674
2514.433696898406
2527.249667217526
2524.7751937008948
2490.0621013787577
2529.0631261691633
2474.905377092883
2507.2416820931517
⋮
2507.195103329959
julia> Ccpu
3000element Vector{Float64}:
738.0790280921476
747.322912434719
753.8496326244853
772.8891677414435
752.6666222077597
743.9383860237455
755.9752339965773
742.7187969085796
755.212427805986
748.8083971634609
748.226751429639
759.2037226616454
739.5902158523694
⋮
748.9861236734295

I want to ask if I’m missing something while programming, or if this could be a bug? When concurrent programming overlaps the operations of the GPU and the CPU, the calculation results cannot be obtained by using the builtin functions of julia or CUDA.jl. Only the functions written by oneself can be used to obtain the calculation results?

What I also want to ask is whether the GPU can use idle threads to calculate other kernel functions when it is calculating a kernel function? If so how should the code be written?
I have made the following attempts and still can’t get the result, and the calculation time has not been shortened.
using CUDA
ngpu1 = 10000
ngpu2= 5000
Agpu1 = CUDA.rand(Float64,ngpu1,ngpu1)
Bgpu1 = CUDA.rand(Float64,ngpu1)
Agpu2 = CUDA.rand(Float64,ngpu2,ngpu2)
Bgpu2 = CUDA.rand(Float64,ngpu2)
Cgpu1 = CUDA.zeros(Float64,ngpu1)
Cgpu2 = CUDA.zeros(Float64,ngpu2)
begin
@async begin
Cgpu1 = Agpu1*Bgpu1
synchronize()
end
@async begin
Cgpu2 = Agpu2*Bgpu2
synchronize()
end
end
julia> Cgpu1
10000element CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}:
0.0
0.0
0.0
⋮
0.0
julia> Cgpu2
5000element CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}:
0.0
0.0
0.0
⋮
0.0
julia version : 1.7.2
CUDA.jl version: v4.0.1
Thanks for taking the time to read my questions.