Hi,
I’m new to julia parallel programming.I want to make the idle CPU work together when the GPU is doing calculations, so I wrote the following code. However, what confuses me is that I can’t get the calculation result.
using CUDA
ngpu = 10000
ncpu = 3000
Acpu = rand(Float64,ncpu,ncpu)
Bcpu = rand(Float64,ncpu)
Ccpu = zeros(Float64,ncpu)
Agpu = CUDA.rand(Float64,ngpu,ngpu)
Bgpu = CUDA.rand(Float64,ngpu)
Cgpu = CUDA.zeros(Float64,ngpu)
@sync begin
@async begin
Cgpu = Agpu*Bgpu
synchronize()
end
@async begin
Ccpu = Acpu*Bcpu
end
synchronize()
end
Compared with the operation without overlap GPU with CPU operations, the time is indeed shortened. However, when I checked the value of Ccpu and the value of Cgpu, I found that there was no result and it was still zero.
julia> Cgpu
10000-element CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}:
0.0
0.0
0.0
⋮
0.0
julia> Ccpu
3000-element Vector{Float64}:
0.0
0.0
0.0
⋮
0.0
After that, I tried to write GPU and CPU codes by myself and found that I could get calculation results.
The GPU function is as follows:
function MatrixVectorMul!(Agpu,Bgpu,Cgpu)
it = (blockIdx().x-1) * blockDim().x + threadIdx().x
num = size(Agpu,1)
if it > num
return
end
for i = 1:num
Cgpu[it] = Cgpu[it] + Agpu[it,i]*Bgpu[i]
end
return
end
The CPU function is as follows:
function MatrixVectorMulcpu!(Acpu,Bcpu,Ccpu)
num = size(Acpu,1)
for i = 1:num
for j = 1:num
Ccpu[j] = Ccpu[j] + Acpu[j,i]*Bcpu[i]
end
end
end
Tried overlaping GPU with CPU operations as before:
ngpu1 = 10000
ngpu2= 5000
Agpu1 = CUDA.rand(Float64,ngpu1,ngpu1)
Bgpu1 = CUDA.rand(Float64,ngpu1)
Agpu2 = CUDA.rand(Float64,ngpu2,ngpu2)
Bgpu2 = CUDA.rand(Float64,ngpu2)
Cgpu1 = CUDA.zeros(Float64,ngpu1)
Cgpu2 = CUDA.zeros(Float64,ngpu2)
@sync begin
@async begin
CUDA.@sync @cuda(
threads = 256,
blocks = cld(size(Agpu,1),256),
MatrixVectorMul!(Agpu,Bgpu,Cgpu)
)
end
@async begin
MatrixVectorMulcpu!(Acpu,Bcpu,Ccpu)
end
synchronize()
end
This time I got the result:
julia> Cgpu
10000-element CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}:
2519.9909927774434
2506.254771662522
2519.7125246753635
2524.0248509403823
2488.2238108879537
2487.392467114674
2514.433696898406
2527.249667217526
2524.7751937008948
2490.0621013787577
2529.0631261691633
2474.905377092883
2507.2416820931517
⋮
2507.195103329959
julia> Ccpu
3000-element Vector{Float64}:
738.0790280921476
747.322912434719
753.8496326244853
772.8891677414435
752.6666222077597
743.9383860237455
755.9752339965773
742.7187969085796
755.212427805986
748.8083971634609
748.226751429639
759.2037226616454
739.5902158523694
⋮
748.9861236734295
-
I want to ask if I’m missing something while programming, or if this could be a bug? When concurrent programming overlaps the operations of the GPU and the CPU, the calculation results cannot be obtained by using the built-in functions of julia or CUDA.jl. Only the functions written by oneself can be used to obtain the calculation results?
-
What I also want to ask is whether the GPU can use idle threads to calculate other kernel functions when it is calculating a kernel function? If so how should the code be written?
I have made the following attempts and still can’t get the result, and the calculation time has not been shortened.
using CUDA
ngpu1 = 10000
ngpu2= 5000
Agpu1 = CUDA.rand(Float64,ngpu1,ngpu1)
Bgpu1 = CUDA.rand(Float64,ngpu1)
Agpu2 = CUDA.rand(Float64,ngpu2,ngpu2)
Bgpu2 = CUDA.rand(Float64,ngpu2)
Cgpu1 = CUDA.zeros(Float64,ngpu1)
Cgpu2 = CUDA.zeros(Float64,ngpu2)
begin
@async begin
Cgpu1 = Agpu1*Bgpu1
synchronize()
end
@async begin
Cgpu2 = Agpu2*Bgpu2
synchronize()
end
end
julia> Cgpu1
10000-element CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}:
0.0
0.0
0.0
⋮
0.0
julia> Cgpu2
5000-element CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}:
0.0
0.0
0.0
⋮
0.0
julia version : 1.7.2
CUDA.jl version: v4.0.1
Thanks for taking the time to read my questions.