Dynamic parallelism slow in CUDA.jl

Hi, i’m learning dynamic parallelism in CUDA.jl . Here i have a code which executes a parent kernel concurrently 1000 times, and each parent kernel queues the child kernel N times. My problem is that this code is pretty slow, and the running time increases exponentially with N, even though the child kernel literally does nothing.

Could someone help me where I was doing wrong here?

using CUDA

function example_parent(N)
    for i in 1:N
        @cuda threads = 100 dynamic = true example_child()
    end
    return nothing
end

function example_child()
    return nothing
end

function test(N)
    CUDA.@sync begin
        @cuda threads = 1000 example_parent(N)
    end
end

test(2) # run once to compile
CUDA.@time test(2) # 0.052070 seconds (9 CPU allocations: 416 bytes)
CUDA.@time test(3) # 6.671022 seconds (397 CPU allocations: 25.188 KiB)

By default, space is reserved for 2048 pending child grids; this can be extended by setting the appropriate device limit, as in the following code.

The runtime first tries to add the newly launched grid to the fixed-size pool, and if it is full, uses the virtualized pool. While this means that grids are queued successfully, the costs of using the virtualized pool are higher than those of the fixed-size pool.

1 Like