Thread local storage is not implemented

It would be nice to have “thread local storage” implemented at some point. It looks like it might be necessary for recursive kernel calls. I’m not really sure as I’m still rather new to CUDA programming.

PTX’s local memory is backed by global memory, and is thus slow. Why not use StaticArrays for fast thread-local memory? Is there a particular reason you need the former?

1 Like

I am currently using static arrays. I’m still new to CUDA programming, so I’ll have to investigate this some more. Thanks.

Hi, @maleadt, is it guaranteed that a static array is backed by thread-local registers in CUDA.jl? In NVIDIA’s documentation, there is

Local memory is so named because its scope is local to the thread, not because of its physical location. In fact, local memory is off-chip. Hence, access to local memory is as expensive as access to global memory.

And also

Automatic variables that are likely to be placed in local memory are large structures or arrays that would consume too much register space and arrays that the compiler determines may be indexed dynamically.

What does “the compiler determines may be indexed dynamically” refer to in Julia? Let’s consider the following example.

using CUDA, StaticArrays

function kernel()
    sa = SA_F32[1, 2, 3, 4, 5]
    s = 0.0f32
    for i in eachindex(sa)
        s += sa[i]
    @cuprintf("s = %f", s)

@cuda kernel()

Is the static array sa dynamically indexed above (since i is not a constant)?

Besides, how can we inspect the generated code by CUDA.jl to confirm whether we are using registers or local memory for the thread-local arrays?

StaticArrays are implemented as structs containing tuples, so it’ll be using registers and not PTX’s local memory. You can inspect generated code using @device_code_ptx.