Clarifying expected behavior of dynamic CUDA kernels

NonDairyNeutrino · January 12, 2025, 4:27am

Dynamic Parallelism

I’ve just learned about Dynamic Parallelism that is

…useful for recursive algorithms, or for algorithms that otherwise need to dynamically spawn new work.

So consider the following example:

using CUDA

threadCount = 10

function innerKernel(outerThread)
    innerThread = threadIdx().x
    @cuprintln("outer: $outerThread, inner: $innerThread")
    return
end

function outerKernelDynamic()
    outerThread = threadIdx().x
    @cuda dynamic=true innerKernel(outerThread)
    return
end

@cuda threads=threadCount outerKernelDynamic()

which produces

outer: 1, inner: 1
outer: 2, inner: 1
outer: 3, inner: 1
outer: 4, inner: 1
outer: 5, inner: 1
outer: 6, inner: 1
outer: 7, inner: 1
outer: 8, inner: 1
outer: 9, inner: 1
outer: 10, inner: 1

Confusion

But I’m confused what the dynamic call is actually doing. It appears to be invoking all inner calls on thread 1, but my expectation is that each inner call would be executed on different threads e.g.

outer: 2, inner: 9
outer: 4, inner: 7
…

The example in the documentation doesn’t depend on which thread it’s running so is rather unhelpful in this case.

Full Code

using CUDA

threadCount = 10

function innerKernel(outerThread)
    innerThread = threadIdx().x
    @cuprintln("outer: $outerThread, inner: $innerThread")
    return
end

function outerKernelStatic()
    outerThread = threadIdx().x
    innerKernel(outerThread)
    return
end

function outerKernelDynamic()
    outerThread = threadIdx().x
    @cuda dynamic=true innerKernel(outerThread)
    return
end

println("Executing static kernel")
CUDA.@sync @cuda threads=threadCount outerKernelStatic()

println("Executing dynamic kernel")
CUDA.@sync @cuda threads=threadCount outerKernelDynamic()

maleadt · January 12, 2025, 7:26am

That’s the wrong understanding. Every dynamic kernel launch is just that, another kernel launch, so the inner kernel starts with a fresh grid where threads are numbered from 1 again.

NonDairyNeutrino · January 12, 2025, 5:55pm

I guess that…makes perfect sense. But then if the outerKernelDynamic is changed to launch on multiple threads as usual as

function outerKernelDynamic()
    outerThread = threadIdx().x
    @cuda dynamic=true threads=threadCount innerKernel(outerThread)
    return
end

there’s the big wall of an error

ERROR: LoadError: GPUCompiler.InvalidIRError(GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}(MethodInstance for outerKernelDynamic(), GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}(GPUCompiler.PTXCompilerTarget(v"8.6.0", v"7.8.0", true, nothing, nothing, nothing, nothing, false, nothing, nothing), CUDA.CUDACompilerParams(v"8.6.0", v"8.5.0"), true, nothing, :specfunc, false, 2), 0x0000000000006897), Tuple{String, Vector{Base.StackTraces.StackFrame}, Any}[(“call to an unknown function”, [macro expansion at execution.jl:96, outerKernelDynamic at dynamic_kernel_test.jl:19], “jl_f_tuple”), (“call to an unknown function”, [NamedTuple at boot.jl:727, macro expansion at execution.jl:96, outerKernelDynamic at dynamic_kernel_test.jl:19], “jl_f_apply_type”), (“call to an unknown function”, [NamedTuple at boot.jl:727, macro expansion at execution.jl:96, outerKernelDynamic at dynamic_kernel_test.jl:19], “ijl_new_structv”), (“dynamic function invocation”, [macro expansion at execution.jl:96, outerKernelDynamic at dynamic_kernel_test.jl:19], Core.kwcall)])

maleadt · January 12, 2025, 6:47pm

threadCount is not defined in your kernel, as should be shown by the error (Reason: unsupported use of an undefined name (use of 'threadCount')). You could either forward that as an arg, or look up the block size.

NonDairyNeutrino · January 12, 2025, 8:33pm

On further thought, this makes perfect sense (I think?). Basically, it’s because there is no variable threadCount on the device as it’s not automatically inferred or copied from the host process/shared memory to the device process/shared memory?

I find it interesting that while there’s the distinction between host and device arrays with Array being called from and stored on the host, and CuArray that’s called from the host and stored on the device, and CuDeviceArray which is called from and stored on the device (as far as I understand), there’s seemingly no distinction between host and device scalars as, say. Float32 that’s called from and stored on the host, and likewise CuFloat32 and CuDevice32 to communicate that these are stored and otherwise accessible to functions on the device.

All to say, making the process explicit with something like

threadCount       = Float32(10.)    # accessible on the host
threadCountDevice = cu(threadCount) # CuFloat32 accessible on the device

is at least an interesting thought to me for the sake of consistency. I’m sure there’s a good reason, I just find the inconsistency interesting.

Topic		Replies	Views
Kernel with dynamic parallelism seems to be calling CPU functions GPU	4	122	July 19, 2025
CUDA.jl - A Clear Example of Dynamic Parallelism GPU cuda , kernel	6	2382	November 18, 2022
Dynamic parallelism slow in CUDA.jl GPU	1	90	July 25, 2024
Error when using dynamic parallelism with six or more arguments GPU	1	429	August 14, 2020
Compiling kernel resulted in invalid LLVM IR Reason: unsupported dynamic function invocation GPU	11	4333	October 16, 2020

Clarifying expected behavior of dynamic CUDA kernels

Dynamic Parallelism

Confusion

Full Code

Related topics