Hi, i’m learning dynamic parallelism in CUDA.jl
. Here i have a code which executes a parent kernel concurrently 1000 times, and each parent kernel queues the child kernel N
times. My problem is that this code is pretty slow, and the running time increases exponentially with N
, even though the child kernel literally does nothing.
Could someone help me where I was doing wrong here?
using CUDA
function example_parent(N)
for i in 1:N
@cuda threads = 100 dynamic = true example_child()
end
return nothing
end
function example_child()
return nothing
end
function test(N)
CUDA.@sync begin
@cuda threads = 1000 example_parent(N)
end
end
test(2) # run once to compile
CUDA.@time test(2) # 0.052070 seconds (9 CPU allocations: 416 bytes)
CUDA.@time test(3) # 6.671022 seconds (397 CPU allocations: 25.188 KiB)