Clarifying expected behavior of dynamic CUDA kernels

Dynamic Parallelism

I’ve just learned about Dynamic Parallelism that is

…useful for recursive algorithms, or for algorithms that otherwise need to dynamically spawn new work.

So consider the following example:

using CUDA

threadCount = 10

function innerKernel(outerThread)
    innerThread = threadIdx().x
    @cuprintln("outer: $outerThread, inner: $innerThread")
    return
end

function outerKernelDynamic()
    outerThread = threadIdx().x
    @cuda dynamic=true innerKernel(outerThread)
    return
end

@cuda threads=threadCount outerKernelDynamic()

which produces

outer: 1, inner: 1
outer: 2, inner: 1
outer: 3, inner: 1
outer: 4, inner: 1
outer: 5, inner: 1
outer: 6, inner: 1
outer: 7, inner: 1
outer: 8, inner: 1
outer: 9, inner: 1
outer: 10, inner: 1

Confusion

But I’m confused what the dynamic call is actually doing. It appears to be invoking all inner calls on thread 1, but my expectation is that each inner call would be executed on different threads e.g.

outer: 2, inner: 9
outer: 4, inner: 7

The example in the documentation doesn’t depend on which thread it’s running so is rather unhelpful in this case.

Full Code

using CUDA

threadCount = 10

function innerKernel(outerThread)
    innerThread = threadIdx().x
    @cuprintln("outer: $outerThread, inner: $innerThread")
    return
end

function outerKernelStatic()
    outerThread = threadIdx().x
    innerKernel(outerThread)
    return
end

function outerKernelDynamic()
    outerThread = threadIdx().x
    @cuda dynamic=true innerKernel(outerThread)
    return
end

println("Executing static kernel")
CUDA.@sync @cuda threads=threadCount outerKernelStatic()

println("Executing dynamic kernel")
CUDA.@sync @cuda threads=threadCount outerKernelDynamic()

That’s the wrong understanding. Every dynamic kernel launch is just that, another kernel launch, so the inner kernel starts with a fresh grid where threads are numbered from 1 again.

1 Like

I guess that…makes perfect sense. But then if the outerKernelDynamic is changed to launch on multiple threads as usual as

function outerKernelDynamic()
    outerThread = threadIdx().x
    @cuda dynamic=true threads=threadCount innerKernel(outerThread)
    return
end

there’s the big wall of an error

ERROR: LoadError: GPUCompiler.InvalidIRError(GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}(MethodInstance for outerKernelDynamic(), GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}(GPUCompiler.PTXCompilerTarget(v"8.6.0", v"7.8.0", true, nothing, nothing, nothing, nothing, false, nothing, nothing), CUDA.CUDACompilerParams(v"8.6.0", v"8.5.0"), true, nothing, :specfunc, false, 2), 0x0000000000006897), Tuple{String, Vector{Base.StackTraces.StackFrame}, Any}[(“call to an unknown function”, [macro expansion at execution.jl:96, outerKernelDynamic at dynamic_kernel_test.jl:19], “jl_f_tuple”), (“call to an unknown function”, [NamedTuple at boot.jl:727, macro expansion at execution.jl:96, outerKernelDynamic at dynamic_kernel_test.jl:19], “jl_f_apply_type”), (“call to an unknown function”, [NamedTuple at boot.jl:727, macro expansion at execution.jl:96, outerKernelDynamic at dynamic_kernel_test.jl:19], “ijl_new_structv”), (“dynamic function invocation”, [macro expansion at execution.jl:96, outerKernelDynamic at dynamic_kernel_test.jl:19], Core.kwcall)])

threadCount is not defined in your kernel, as should be shown by the error (Reason: unsupported use of an undefined name (use of 'threadCount')). You could either forward that as an arg, or look up the block size.

1 Like

On further thought, this makes perfect sense (I think?). Basically, it’s because there is no variable threadCount on the device as it’s not automatically inferred or copied from the host process/shared memory to the device process/shared memory?

I find it interesting that while there’s the distinction between host and device arrays with Array being called from and stored on the host, and CuArray that’s called from the host and stored on the device, and CuDeviceArray which is called from and stored on the device (as far as I understand), there’s seemingly no distinction between host and device scalars as, say. Float32 that’s called from and stored on the host, and likewise CuFloat32 and CuDevice32 to communicate that these are stored and otherwise accessible to functions on the device.

All to say, making the process explicit with something like

threadCount       = Float32(10.)    # accessible on the host
threadCountDevice = cu(threadCount) # CuFloat32 accessible on the device

is at least an interesting thought to me for the sake of consistency. I’m sure there’s a good reason, I just find the inconsistency interesting.