How to perform GPU overlap operations on the custom kernel function?

Hi,

I’m trying to use different GPU kernels for overlapping operations because a single kernel usually cannot fully utilize GPU hardware resources. I defined a kernel function and tried the following, but the running time did not shorten as I expected, but it was almost equal to the sum of the serial running time of the two kernel functions. It should be my understanding error or programming error.

  • I want to know whether it is possible to run independent GPU kernel functions in parallel on different streams of the same GPU, overlapping the running time to shorten the overall running time and fully utilize the GPU performance.

  • If so, how should the program be written? It would be appreciated if there were some sample programs for reference.

Thanks!

using CUDA
using BenchmarkTools

n = 10000

Agpu1 =  CUDA.ones(Float64,n,n)
Bgpu1 =  CUDA.ones(Float64,n)

Agpu2 =  CUDA.ones(Float64,n,n)
Bgpu2 =  CUDA.ones(Float64,n)

Cgpu1 =  CUDA.zeros(Float64,n)
Cgpu2 =  CUDA.zeros(Float64,n)

#custom kernel function
function MatrixVectorMul!(Agpu,Bgpu,Cgpu)
    it = (blockIdx().x-1) * blockDim().x + threadIdx().x
    num = size(Agpu,1) 
    if  it >   num 
        return
    end

    for  i = 1:num
        Cgpu[it] =  Cgpu[it] + Agpu[it,i]*Bgpu[i]
    end
    return
end

#overlap GPU operations test

@btime @sync begin
    @async begin
             CUDA.@sync @cuda(
            threads = 256, 
            blocks = cld(size(Agpu1,1),256), 
            MatrixVectorMul!(Agpu1,Bgpu1,Cgpu1)
        )
    end
    @async begin 
            CUDA.@sync @cuda(
            threads = 256, 
            blocks = cld(size(Agpu2,1),256), 
            MatrixVectorMul!(Agpu2,Bgpu2,Cgpu2)
        )
      
    end
end 

# total running time:  7.711 ms 
# Single kernel running time: 3.917ms
# Inconsistent with expected results. The total time is not significantly less than twice the running time of the single kernel function.

# CUDA.jl Version: 4.4.0
# Julia Version: 1.8.5

Your program is written correctly, whether execution will overlap or not is up to the CUDA driver. You also better use a proper profiler to verify whether execution overlaps or not (i.e., NSight Systems).

1 Like

Thanks for your reply. I used NSight Systems to analyze and found that the timelines of the two functions did not overlap.
I want to ask again, under what circumstances will this programming method have an acceleration effect, or does the CUDA driver randomly determine it, and the programmer cannot control it?

It’s not random, of course, but no you do not have control over it. You can only express which tasks can execute concurrently, but the driver will decide (based on resource availability and kernel properties) whether to actually overlap. Just make sure, in NSight, that the kernels are really using different streams. Furthermore, in more realistic applications you will also perform your memory copies from pinned memory in a task, and those are much easier to overlap (e.g., with another kernel, so that you aren’t expecting concurrent execution while contending for similar resources).

1 Like

Thanks for your reply. It helped me a lot.
I wrote the following code to try again, but what puzzles me is why the running time of the overlap operation is longer and there are more allocations? My previous understanding is that the running time of the overlap operation should be less than or equal to the time when it is not used.

using CUDA
using BenchmarkTools

a = CUDA.ones(1000)
b = 2*CUDA.ones(1000)
c = CUDA.ones(1000)
d = 3*CUDA.ones(1000)

#overlap
@btime @sync begin
    @async begin
        Ξ± = CUDA.dot(a,b)
    end
    @async begin
        Ξ² = CUDA.dot(c,d)
    end
end
#  170.200 ΞΌs (68 allocations: 4.70 KiB)   
# why longer?

#not overlap
@btime @sync begin
    begin
        Ξ± = CUDA.dot(a,b)
    end
    begin
        Ξ² = CUDA.dot(c,d)
    end
end
# 128.100 ΞΌs (15 allocations: 512 bytes)

You should use an actual profiler, as I suggested before. Then you would see that your kernels are so short that overlap isn’t possible (as they finish before the next kernel is queued):

Increasing the problem size here doesn’t β€œhelp”, because then the GPU seems fully occupied performing the dot operation. You just shouldn’t think of overlapping streams of execution as CPU threads to linearly scale performance with, but as tool to inform the driver that operations are independent, which may improve performance in some cases.

1 Like

Thank you for your thorough response! I comprehend.

And FYI, in the next CUDA.jl version you’ll be able to use the integrated profiler to figure this out too (albeit not in a nice graphical way):

julia> CUDA.@profile trace=true @sync begin
           @async begin
               Ξ± = CUDA.dot(a,b)
           end
           @async begin
               Ξ² = CUDA.dot(c,d)
           end
       end
Profiler ran for 104.28 ms, capturing 53 events.

Host-side activity: calling CUDA APIs took 98.65 ms (94.60% of the trace)
β”Œβ”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ ID β”‚     Start β”‚      Time β”‚ Name                  β”‚
β”œβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  1 β”‚   3.19 ms β”‚ 953.67 ns β”‚ cuDeviceGet           β”‚
β”‚  2 β”‚   3.19 ms β”‚ 238.42 ns β”‚ cuDeviceGetCount      β”‚
β”‚  4 β”‚    3.2 ms β”‚   6.68 Β΅s β”‚ cuStreamCreate        β”‚
β”‚ 14 β”‚   3.23 ms β”‚  98.54 ms β”‚ cudaLaunchKernel      β”‚
β”‚ 16 β”‚ 101.77 ms β”‚   6.44 Β΅s β”‚ cudaLaunchKernel      β”‚
β”‚ 18 β”‚ 101.78 ms β”‚  44.58 Β΅s β”‚ cudaMemcpyAsync       β”‚
β”‚ 23 β”‚ 101.83 ms β”‚ 953.67 ns β”‚ cudaStreamSynchronize β”‚
β”‚ 24 β”‚ 104.21 ms β”‚ 715.26 ns β”‚ cuDeviceGet           β”‚
β”‚ 25 β”‚ 104.21 ms β”‚ 476.84 ns β”‚ cuDeviceGetCount      β”‚
β”‚ 27 β”‚ 104.21 ms β”‚   4.05 Β΅s β”‚ cuStreamCreate        β”‚
β”‚ 37 β”‚ 104.23 ms β”‚  10.25 Β΅s β”‚ cudaLaunchKernel      β”‚
β”‚ 39 β”‚ 104.24 ms β”‚   5.48 Β΅s β”‚ cudaLaunchKernel      β”‚
β”‚ 41 β”‚ 104.25 ms β”‚  13.11 Β΅s β”‚ cudaMemcpyAsync       β”‚
β”‚ 46 β”‚ 104.26 ms β”‚ 953.67 ns β”‚ cudaStreamSynchronize β”‚
β””β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Device-side activity: GPU was busy for 7.63 Β΅s (0.01% of the trace)
β”Œβ”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ ID β”‚     Start β”‚      Time β”‚ Stream β”‚ Threads β”‚ Blocks β”‚ Regs β”‚     SSMem β”‚    Size β”‚  Throughput β”‚ Name                                                                                     β‹―
β”œβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 14 β”‚ 101.82 ms β”‚   1.67 Β΅s β”‚     21 β”‚     128 β”‚      8 β”‚   22 β”‚ 512 bytes β”‚       - β”‚           - β”‚ _Z10dot_kernelIfLi128ELi0E15cublasDotParamsI16cublasGemvTensorIKfE30cublasGemvTensorStri β‹―
β”‚ 16 β”‚ 101.82 ms β”‚   1.19 Β΅s β”‚     21 β”‚     128 β”‚      1 β”‚   28 β”‚ 768 bytes β”‚       - β”‚           - β”‚ _Z20reduce_1Block_kernelIfLi128ELi7E30cublasGemvTensorStridedBatchedIfES1_S1_EvPKT_S2_T2 β‹―
β”‚ 18 β”‚ 101.82 ms β”‚   1.43 Β΅s β”‚     21 β”‚       - β”‚      - β”‚    - β”‚         - β”‚ 4 bytes β”‚ 2.667 MiB/s β”‚ [copy device to pageable memory]                                                         β‹―
β”‚ 37 β”‚ 104.24 ms β”‚   1.19 Β΅s β”‚     22 β”‚     128 β”‚      8 β”‚   22 β”‚ 512 bytes β”‚       - β”‚           - β”‚ _Z10dot_kernelIfLi128ELi0E15cublasDotParamsI16cublasGemvTensorIKfE30cublasGemvTensorStri β‹―
β”‚ 39 β”‚ 104.25 ms β”‚   1.19 Β΅s β”‚     22 β”‚     128 β”‚      1 β”‚   28 β”‚ 768 bytes β”‚       - β”‚           - β”‚ _Z20reduce_1Block_kernelIfLi128ELi7E30cublasGemvTensorStridedBatchedIfES1_S1_EvPKT_S2_T2 β‹―
β”‚ 41 β”‚ 104.25 ms β”‚ 953.67 ns β”‚     22 β”‚       - β”‚      - β”‚    - β”‚         - β”‚ 4 bytes β”‚ 4.000 MiB/s β”‚ [copy device to pageable memory]                                                         β‹―
└────┴───────────┴───────────┴────────┴─────────┴────────┴──────┴───────────┴─────────┴─────────────┴───────────────────────────────────────────────────────────────────────────────────────────
3 Likes

Thanks! I will try.