How to perform GPU overlap operations on the custom kernel function?

fred_wu · August 16, 2023, 9:07am

Hi,

I’m trying to use different GPU kernels for overlapping operations because a single kernel usually cannot fully utilize GPU hardware resources. I defined a kernel function and tried the following, but the running time did not shorten as I expected, but it was almost equal to the sum of the serial running time of the two kernel functions. It should be my understanding error or programming error.

I want to know whether it is possible to run independent GPU kernel functions in parallel on different streams of the same GPU, overlapping the running time to shorten the overall running time and fully utilize the GPU performance.
If so, how should the program be written? It would be appreciated if there were some sample programs for reference.

Thanks!

using CUDA
using BenchmarkTools

n = 10000

Agpu1 =  CUDA.ones(Float64,n,n)
Bgpu1 =  CUDA.ones(Float64,n)

Agpu2 =  CUDA.ones(Float64,n,n)
Bgpu2 =  CUDA.ones(Float64,n)

Cgpu1 =  CUDA.zeros(Float64,n)
Cgpu2 =  CUDA.zeros(Float64,n)

#custom kernel function
function MatrixVectorMul!(Agpu,Bgpu,Cgpu)
    it = (blockIdx().x-1) * blockDim().x + threadIdx().x
    num = size(Agpu,1) 
    if  it >   num 
        return
    end

    for  i = 1:num
        Cgpu[it] =  Cgpu[it] + Agpu[it,i]*Bgpu[i]
    end
    return
end

#overlap GPU operations test

@btime @sync begin
    @async begin
             CUDA.@sync @cuda(
            threads = 256, 
            blocks = cld(size(Agpu1,1),256), 
            MatrixVectorMul!(Agpu1,Bgpu1,Cgpu1)
        )
    end
    @async begin 
            CUDA.@sync @cuda(
            threads = 256, 
            blocks = cld(size(Agpu2,1),256), 
            MatrixVectorMul!(Agpu2,Bgpu2,Cgpu2)
        )
      
    end
end 

# total running time:  7.711 ms 
# Single kernel running time: 3.917ms
# Inconsistent with expected results. The total time is not significantly less than twice the running time of the single kernel function.

# CUDA.jl Version: 4.4.0
# Julia Version: 1.8.5

maleadt · August 16, 2023, 10:32am

Your program is written correctly, whether execution will overlap or not is up to the CUDA driver. You also better use a proper profiler to verify whether execution overlaps or not (i.e., NSight Systems).

fred_wu · August 17, 2023, 9:52am

Thanks for your reply. I used NSight Systems to analyze and found that the timelines of the two functions did not overlap.
I want to ask again, under what circumstances will this programming method have an acceleration effect, or does the CUDA driver randomly determine it, and the programmer cannot control it?

maleadt · August 17, 2023, 2:46pm

It’s not random, of course, but no you do not have control over it. You can only express which tasks can execute concurrently, but the driver will decide (based on resource availability and kernel properties) whether to actually overlap. Just make sure, in NSight, that the kernels are really using different streams. Furthermore, in more realistic applications you will also perform your memory copies from pinned memory in a task, and those are much easier to overlap (e.g., with another kernel, so that you aren’t expecting concurrent execution while contending for similar resources).

fred_wu · August 19, 2023, 7:23am

Thanks for your reply. It helped me a lot.
I wrote the following code to try again, but what puzzles me is why the running time of the overlap operation is longer and there are more allocations? My previous understanding is that the running time of the overlap operation should be less than or equal to the time when it is not used.

using CUDA
using BenchmarkTools

a = CUDA.ones(1000)
b = 2*CUDA.ones(1000)
c = CUDA.ones(1000)
d = 3*CUDA.ones(1000)

#overlap
@btime @sync begin
    @async begin
        α = CUDA.dot(a,b)
    end
    @async begin
        β = CUDA.dot(c,d)
    end
end
#  170.200 μs (68 allocations: 4.70 KiB)   
# why longer?

#not overlap
@btime @sync begin
    begin
        α = CUDA.dot(a,b)
    end
    begin
        β = CUDA.dot(c,d)
    end
end
# 128.100 μs (15 allocations: 512 bytes)

maleadt · August 25, 2023, 8:46am

You should use an actual profiler, as I suggested before. Then you would see that your kernels are so short that overlap isn’t possible (as they finish before the next kernel is queued):

Increasing the problem size here doesn’t “help”, because then the GPU seems fully occupied performing the dot operation. You just shouldn’t think of overlapping streams of execution as CPU threads to linearly scale performance with, but as tool to inform the driver that operations are independent, which may improve performance in some cases.

fred_wu · August 25, 2023, 1:00pm

Thank you for your thorough response! I comprehend.

maleadt · August 25, 2023, 1:15pm

And FYI, in the next CUDA.jl version you’ll be able to use the integrated profiler to figure this out too (albeit not in a nice graphical way):

julia> CUDA.@profile trace=true @sync begin
           @async begin
               α = CUDA.dot(a,b)
           end
           @async begin
               β = CUDA.dot(c,d)
           end
       end
Profiler ran for 104.28 ms, capturing 53 events.

Host-side activity: calling CUDA APIs took 98.65 ms (94.60% of the trace)
┌────┬───────────┬───────────┬───────────────────────┐
│ ID │     Start │      Time │ Name                  │
├────┼───────────┼───────────┼───────────────────────┤
│  1 │   3.19 ms │ 953.67 ns │ cuDeviceGet           │
│  2 │   3.19 ms │ 238.42 ns │ cuDeviceGetCount      │
│  4 │    3.2 ms │   6.68 µs │ cuStreamCreate        │
│ 14 │   3.23 ms │  98.54 ms │ cudaLaunchKernel      │
│ 16 │ 101.77 ms │   6.44 µs │ cudaLaunchKernel      │
│ 18 │ 101.78 ms │  44.58 µs │ cudaMemcpyAsync       │
│ 23 │ 101.83 ms │ 953.67 ns │ cudaStreamSynchronize │
│ 24 │ 104.21 ms │ 715.26 ns │ cuDeviceGet           │
│ 25 │ 104.21 ms │ 476.84 ns │ cuDeviceGetCount      │
│ 27 │ 104.21 ms │   4.05 µs │ cuStreamCreate        │
│ 37 │ 104.23 ms │  10.25 µs │ cudaLaunchKernel      │
│ 39 │ 104.24 ms │   5.48 µs │ cudaLaunchKernel      │
│ 41 │ 104.25 ms │  13.11 µs │ cudaMemcpyAsync       │
│ 46 │ 104.26 ms │ 953.67 ns │ cudaStreamSynchronize │
└────┴───────────┴───────────┴───────────────────────┘

Device-side activity: GPU was busy for 7.63 µs (0.01% of the trace)
┌────┬───────────┬───────────┬────────┬─────────┬────────┬──────┬───────────┬─────────┬─────────────┬───────────────────────────────────────────────────────────────────────────────────────────
│ ID │     Start │      Time │ Stream │ Threads │ Blocks │ Regs │     SSMem │    Size │  Throughput │ Name                                                                                     ⋯
├────┼───────────┼───────────┼────────┼─────────┼────────┼──────┼───────────┼─────────┼─────────────┼───────────────────────────────────────────────────────────────────────────────────────────
│ 14 │ 101.82 ms │   1.67 µs │     21 │     128 │      8 │   22 │ 512 bytes │       - │           - │ _Z10dot_kernelIfLi128ELi0E15cublasDotParamsI16cublasGemvTensorIKfE30cublasGemvTensorStri ⋯
│ 16 │ 101.82 ms │   1.19 µs │     21 │     128 │      1 │   28 │ 768 bytes │       - │           - │ _Z20reduce_1Block_kernelIfLi128ELi7E30cublasGemvTensorStridedBatchedIfES1_S1_EvPKT_S2_T2 ⋯
│ 18 │ 101.82 ms │   1.43 µs │     21 │       - │      - │    - │         - │ 4 bytes │ 2.667 MiB/s │ [copy device to pageable memory]                                                         ⋯
│ 37 │ 104.24 ms │   1.19 µs │     22 │     128 │      8 │   22 │ 512 bytes │       - │           - │ _Z10dot_kernelIfLi128ELi0E15cublasDotParamsI16cublasGemvTensorIKfE30cublasGemvTensorStri ⋯
│ 39 │ 104.25 ms │   1.19 µs │     22 │     128 │      1 │   28 │ 768 bytes │       - │           - │ _Z20reduce_1Block_kernelIfLi128ELi7E30cublasGemvTensorStridedBatchedIfES1_S1_EvPKT_S2_T2 ⋯
│ 41 │ 104.25 ms │ 953.67 ns │     22 │       - │      - │    - │         - │ 4 bytes │ 4.000 MiB/s │ [copy device to pageable memory]                                                         ⋯
└────┴───────────┴───────────┴────────┴─────────┴────────┴──────┴───────────┴─────────┴─────────────┴───────────────────────────────────────────────────────────────────────────────────────────

fred_wu · August 26, 2023, 9:23am

Thanks! I will try.

Topic		Replies	Views
CUDA streams do not overlap GPU question	6	3116	July 1, 2019
Questions about using CUDA.jl for GPU concurrent programming: Computational results cannot be obtained when overlapping GPU and CPU operations GPU question	2	442	April 12, 2023
Can the CPU function be a multi-process parallel function when using the Threads.@spawn command to perform overlapping operations between the GPU and the CPU? GPU question , concurrency	3	264	August 24, 2023
Is is possible to merge multiple kernels in CUDAnative to minimize launch overhead and execution overhead? GPU	12	1652	November 11, 2018
Using stream per cpu thread pattern GPU	1	919	June 8, 2019

How to perform GPU overlap operations on the custom kernel function?

Related topics