Dynamic parallelism slow in CUDA.jl

z-wang · July 25, 2024, 10:39am

Hi, i’m learning dynamic parallelism in CUDA.jl . Here i have a code which executes a parent kernel concurrently 1000 times, and each parent kernel queues the child kernel N times. My problem is that this code is pretty slow, and the running time increases exponentially with N, even though the child kernel literally does nothing.

Could someone help me where I was doing wrong here?

using CUDA

function example_parent(N)
    for i in 1:N
        @cuda threads = 100 dynamic = true example_child()
    end
    return nothing
end

function example_child()
    return nothing
end

function test(N)
    CUDA.@sync begin
        @cuda threads = 1000 example_parent(N)
    end
end

test(2) # run once to compile
CUDA.@time test(2) # 0.052070 seconds (9 CPU allocations: 416 bytes)
CUDA.@time test(3) # 6.671022 seconds (397 CPU allocations: 25.188 KiB)

maleadt · July 25, 2024, 6:09pm

By default, space is reserved for 2048 pending child grids; this can be extended by setting the appropriate device limit, as in the following code.
…
The runtime first tries to add the newly launched grid to the fixed-size pool, and if it is full, uses the virtualized pool. While this means that grids are queued successfully, the costs of using the virtualized pool are higher than those of the fixed-size pool.

Topic		Replies	Views
CUDA.jl - A Clear Example of Dynamic Parallelism GPU cuda , kernel	6	2381	November 18, 2022
Clarifying expected behavior of dynamic CUDA kernels GPU question , parallel , cuda , dynamic-parallelism	4	116	January 12, 2025
Kernel with dynamic parallelism seems to be calling CPU functions GPU	4	122	July 19, 2025
Error when using dynamic parallelism with six or more arguments GPU	1	429	August 14, 2020
Status of Dynamic Parallelism Support in CUDANative.jl GPU gpu	4	1019	May 31, 2017

Dynamic parallelism slow in CUDA.jl

Related topics