It would be nice to have “thread local storage” implemented at some point. It looks like it might be necessary for recursive kernel calls. I’m not really sure as I’m still rather new to CUDA programming.
PTX’s local memory is backed by global memory, and is thus slow. Why not use StaticArrays for fast thread-local memory? Is there a particular reason you need the former?
I am currently using static arrays. I’m still new to CUDA programming, so I’ll have to investigate this some more. Thanks.