Hi,
Iβm trying to use different GPU kernels for overlapping operations because a single kernel usually cannot fully utilize GPU hardware resources. I defined a kernel function and tried the following, but the running time did not shorten as I expected, but it was almost equal to the sum of the serial running time of the two kernel functions. It should be my understanding error or programming error.
-
I want to know whether it is possible to run independent GPU kernel functions in parallel on different streams of the same GPU, overlapping the running time to shorten the overall running time and fully utilize the GPU performance.
-
If so, how should the program be written? It would be appreciated if there were some sample programs for reference.
Thanks!
using CUDA
using BenchmarkTools
n = 10000
Agpu1 = CUDA.ones(Float64,n,n)
Bgpu1 = CUDA.ones(Float64,n)
Agpu2 = CUDA.ones(Float64,n,n)
Bgpu2 = CUDA.ones(Float64,n)
Cgpu1 = CUDA.zeros(Float64,n)
Cgpu2 = CUDA.zeros(Float64,n)
#custom kernel function
function MatrixVectorMul!(Agpu,Bgpu,Cgpu)
it = (blockIdx().x-1) * blockDim().x + threadIdx().x
num = size(Agpu,1)
if it > num
return
end
for i = 1:num
Cgpu[it] = Cgpu[it] + Agpu[it,i]*Bgpu[i]
end
return
end
#overlap GPU operations test
@btime @sync begin
@async begin
CUDA.@sync @cuda(
threads = 256,
blocks = cld(size(Agpu1,1),256),
MatrixVectorMul!(Agpu1,Bgpu1,Cgpu1)
)
end
@async begin
CUDA.@sync @cuda(
threads = 256,
blocks = cld(size(Agpu2,1),256),
MatrixVectorMul!(Agpu2,Bgpu2,Cgpu2)
)
end
end
# total running time: 7.711 ms
# Single kernel running time: 3.917ms
# Inconsistent with expected results. The total time is not significantly less than twice the running time of the single kernel function.
# CUDA.jl Version: 4.4.0
# Julia Version: 1.8.5
Your program is written correctly, whether execution will overlap or not is up to the CUDA driver. You also better use a proper profiler to verify whether execution overlaps or not (i.e., NSight Systems).
1 Like
Thanks for your reply. I used NSight Systems to analyze and found that the timelines of the two functions did not overlap.
I want to ask again, under what circumstances will this programming method have an acceleration effect, or does the CUDA driver randomly determine it, and the programmer cannot control it?
Itβs not random, of course, but no you do not have control over it. You can only express which tasks can execute concurrently, but the driver will decide (based on resource availability and kernel properties) whether to actually overlap. Just make sure, in NSight, that the kernels are really using different streams. Furthermore, in more realistic applications you will also perform your memory copies from pinned memory in a task, and those are much easier to overlap (e.g., with another kernel, so that you arenβt expecting concurrent execution while contending for similar resources).
1 Like
Thanks for your reply. It helped me a lot.
I wrote the following code to try again, but what puzzles me is why the running time of the overlap operation is longer and there are more allocations? My previous understanding is that the running time of the overlap operation should be less than or equal to the time when it is not used.
using CUDA
using BenchmarkTools
a = CUDA.ones(1000)
b = 2*CUDA.ones(1000)
c = CUDA.ones(1000)
d = 3*CUDA.ones(1000)
#overlap
@btime @sync begin
@async begin
Ξ± = CUDA.dot(a,b)
end
@async begin
Ξ² = CUDA.dot(c,d)
end
end
# 170.200 ΞΌs (68 allocations: 4.70 KiB)
# why longer?
#not overlap
@btime @sync begin
begin
Ξ± = CUDA.dot(a,b)
end
begin
Ξ² = CUDA.dot(c,d)
end
end
# 128.100 ΞΌs (15 allocations: 512 bytes)
You should use an actual profiler, as I suggested before. Then you would see that your kernels are so short that overlap isnβt possible (as they finish before the next kernel is queued):
Increasing the problem size here doesnβt βhelpβ, because then the GPU seems fully occupied performing the dot
operation. You just shouldnβt think of overlapping streams of execution as CPU threads to linearly scale performance with, but as tool to inform the driver that operations are independent, which may improve performance in some cases.
1 Like
Thank you for your thorough response! I comprehend.
And FYI, in the next CUDA.jl version youβll be able to use the integrated profiler to figure this out too (albeit not in a nice graphical way):
julia> CUDA.@profile trace=true @sync begin
@async begin
Ξ± = CUDA.dot(a,b)
end
@async begin
Ξ² = CUDA.dot(c,d)
end
end
Profiler ran for 104.28 ms, capturing 53 events.
Host-side activity: calling CUDA APIs took 98.65 ms (94.60% of the trace)
ββββββ¬ββββββββββββ¬ββββββββββββ¬ββββββββββββββββββββββββ
β ID β Start β Time β Name β
ββββββΌββββββββββββΌββββββββββββΌββββββββββββββββββββββββ€
β 1 β 3.19 ms β 953.67 ns β cuDeviceGet β
β 2 β 3.19 ms β 238.42 ns β cuDeviceGetCount β
β 4 β 3.2 ms β 6.68 Β΅s β cuStreamCreate β
β 14 β 3.23 ms β 98.54 ms β cudaLaunchKernel β
β 16 β 101.77 ms β 6.44 Β΅s β cudaLaunchKernel β
β 18 β 101.78 ms β 44.58 Β΅s β cudaMemcpyAsync β
β 23 β 101.83 ms β 953.67 ns β cudaStreamSynchronize β
β 24 β 104.21 ms β 715.26 ns β cuDeviceGet β
β 25 β 104.21 ms β 476.84 ns β cuDeviceGetCount β
β 27 β 104.21 ms β 4.05 Β΅s β cuStreamCreate β
β 37 β 104.23 ms β 10.25 Β΅s β cudaLaunchKernel β
β 39 β 104.24 ms β 5.48 Β΅s β cudaLaunchKernel β
β 41 β 104.25 ms β 13.11 Β΅s β cudaMemcpyAsync β
β 46 β 104.26 ms β 953.67 ns β cudaStreamSynchronize β
ββββββ΄ββββββββββββ΄ββββββββββββ΄ββββββββββββββββββββββββ
Device-side activity: GPU was busy for 7.63 Β΅s (0.01% of the trace)
ββββββ¬ββββββββββββ¬ββββββββββββ¬βββββββββ¬ββββββββββ¬βββββββββ¬βββββββ¬ββββββββββββ¬ββββββββββ¬ββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ID β Start β Time β Stream β Threads β Blocks β Regs β SSMem β Size β Throughput β Name β―
ββββββΌββββββββββββΌββββββββββββΌβββββββββΌββββββββββΌβββββββββΌβββββββΌββββββββββββΌββββββββββΌββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 14 β 101.82 ms β 1.67 Β΅s β 21 β 128 β 8 β 22 β 512 bytes β - β - β _Z10dot_kernelIfLi128ELi0E15cublasDotParamsI16cublasGemvTensorIKfE30cublasGemvTensorStri β―
β 16 β 101.82 ms β 1.19 Β΅s β 21 β 128 β 1 β 28 β 768 bytes β - β - β _Z20reduce_1Block_kernelIfLi128ELi7E30cublasGemvTensorStridedBatchedIfES1_S1_EvPKT_S2_T2 β―
β 18 β 101.82 ms β 1.43 Β΅s β 21 β - β - β - β - β 4 bytes β 2.667 MiB/s β [copy device to pageable memory] β―
β 37 β 104.24 ms β 1.19 Β΅s β 22 β 128 β 8 β 22 β 512 bytes β - β - β _Z10dot_kernelIfLi128ELi0E15cublasDotParamsI16cublasGemvTensorIKfE30cublasGemvTensorStri β―
β 39 β 104.25 ms β 1.19 Β΅s β 22 β 128 β 1 β 28 β 768 bytes β - β - β _Z20reduce_1Block_kernelIfLi128ELi7E30cublasGemvTensorStridedBatchedIfES1_S1_EvPKT_S2_T2 β―
β 41 β 104.25 ms β 953.67 ns β 22 β - β - β - β - β 4 bytes β 4.000 MiB/s β [copy device to pageable memory] β―
ββββββ΄ββββββββββββ΄ββββββββββββ΄βββββββββ΄ββββββββββ΄βββββββββ΄βββββββ΄ββββββββββββ΄ββββββββββ΄ββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
3 Likes