…useful for recursive algorithms, or for algorithms that otherwise need to dynamically spawn new work.
So consider the following example:
using CUDA
threadCount = 10
function innerKernel(outerThread)
innerThread = threadIdx().x
@cuprintln("outer: $outerThread, inner: $innerThread")
return
end
function outerKernelDynamic()
outerThread = threadIdx().x
@cuda dynamic=true innerKernel(outerThread)
return
end
@cuda threads=threadCount outerKernelDynamic()
But I’m confused what the dynamic call is actually doing. It appears to be invoking all inner calls on thread 1, but my expectation is that each inner call would be executed on different threads e.g.
outer: 2, inner: 9
outer: 4, inner: 7
…
The example in the documentation doesn’t depend on which thread it’s running so is rather unhelpful in this case.
Full Code
using CUDA
threadCount = 10
function innerKernel(outerThread)
innerThread = threadIdx().x
@cuprintln("outer: $outerThread, inner: $innerThread")
return
end
function outerKernelStatic()
outerThread = threadIdx().x
innerKernel(outerThread)
return
end
function outerKernelDynamic()
outerThread = threadIdx().x
@cuda dynamic=true innerKernel(outerThread)
return
end
println("Executing static kernel")
CUDA.@sync @cuda threads=threadCount outerKernelStatic()
println("Executing dynamic kernel")
CUDA.@sync @cuda threads=threadCount outerKernelDynamic()
That’s the wrong understanding. Every dynamic kernel launch is just that, another kernel launch, so the inner kernel starts with a fresh grid where threads are numbered from 1 again.
I guess that…makes perfect sense. But then if the outerKernelDynamic is changed to launch on multiple threads as usual as
function outerKernelDynamic()
outerThread = threadIdx().x
@cuda dynamic=true threads=threadCount innerKernel(outerThread)
return
end
there’s the big wall of an error
ERROR: LoadError: GPUCompiler.InvalidIRError(GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}(MethodInstance for outerKernelDynamic(), GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}(GPUCompiler.PTXCompilerTarget(v"8.6.0", v"7.8.0", true, nothing, nothing, nothing, nothing, false, nothing, nothing), CUDA.CUDACompilerParams(v"8.6.0", v"8.5.0"), true, nothing, :specfunc, false, 2), 0x0000000000006897), Tuple{String, Vector{Base.StackTraces.StackFrame}, Any}[(“call to an unknown function”, [macro expansion at execution.jl:96, outerKernelDynamic at dynamic_kernel_test.jl:19], “jl_f_tuple”), (“call to an unknown function”, [NamedTuple at boot.jl:727, macro expansion at execution.jl:96, outerKernelDynamic at dynamic_kernel_test.jl:19], “jl_f_apply_type”), (“call to an unknown function”, [NamedTuple at boot.jl:727, macro expansion at execution.jl:96, outerKernelDynamic at dynamic_kernel_test.jl:19], “ijl_new_structv”), (“dynamic function invocation”, [macro expansion at execution.jl:96, outerKernelDynamic at dynamic_kernel_test.jl:19], Core.kwcall)])
threadCount is not defined in your kernel, as should be shown by the error (Reason: unsupported use of an undefined name (use of 'threadCount')). You could either forward that as an arg, or look up the block size.
On further thought, this makes perfect sense (I think?). Basically, it’s because there is no variable threadCount on the device as it’s not automatically inferred or copied from the host process/shared memory to the device process/shared memory?
I find it interesting that while there’s the distinction between host and device arrays with Array being called from and stored on the host, and CuArray that’s called from the host and stored on the device, and CuDeviceArray which is called from and stored on the device (as far as I understand), there’s seemingly no distinction between host and device scalars as, say. Float32 that’s called from and stored on the host, and likewise CuFloat32 and CuDevice32 to communicate that these are stored and otherwise accessible to functions on the device.
All to say, making the process explicit with something like
threadCount = Float32(10.) # accessible on the host
threadCountDevice = cu(threadCount) # CuFloat32 accessible on the device
is at least an interesting thought to me for the sake of consistency. I’m sure there’s a good reason, I just find the inconsistency interesting.