Modifying a thread-local vector within CUDA Dynamic Parallelism

No that is sadly not possible. MArrays on the GPU currently depend on the ability of the compiler to inline all functions that use the MArray, to then turn the GC allocation into a stack allocated value as an optimization.

Since dynamic parallelism is explicitly a non-inlined function this can not occur.

Additionally I don’t even know if CUDA C supports this, since I think you can use dynamic parallelism to launch sub-kernels of different launch configurations and it is not clear to me whose address of the thread local memory would be passed to which thread in the sub-kernel

3 Likes