Thread local storage is not implemented

barrettp · September 15, 2020, 7:11pm

It would be nice to have “thread local storage” implemented at some point. It looks like it might be necessary for recursive kernel calls. I’m not really sure as I’m still rather new to CUDA programming.

maleadt · September 17, 2020, 9:21am

PTX’s local memory is backed by global memory, and is thus slow. Why not use StaticArrays for fast thread-local memory? Is there a particular reason you need the former?

barrettp · September 19, 2020, 4:50pm

I am currently using static arrays. I’m still new to CUDA programming, so I’ll have to investigate this some more. Thanks.

Shuhua · February 21, 2021, 7:37am

Hi, @maleadt, is it guaranteed that a static array is backed by thread-local registers in CUDA.jl? In NVIDIA’s documentation, there is

Local memory is so named because its scope is local to the thread, not because of its physical location. In fact, local memory is off-chip. Hence, access to local memory is as expensive as access to global memory.

And also

Automatic variables that are likely to be placed in local memory are large structures or arrays that would consume too much register space and arrays that the compiler determines may be indexed dynamically.

What does “the compiler determines may be indexed dynamically” refer to in Julia? Let’s consider the following example.

using CUDA, StaticArrays

function kernel()
    sa = SA_F32[1, 2, 3, 4, 5]
    s = 0.0f32
    for i in eachindex(sa)
        s += sa[i]
    end
    @cuprintf("s = %f", s)
    nothing
end

@cuda kernel()

Is the static array sa dynamically indexed above (since i is not a constant)?

Besides, how can we inspect the generated code by CUDA.jl to confirm whether we are using registers or local memory for the thread-local arrays?

maleadt · February 22, 2021, 12:53pm

StaticArrays are implemented as structs containing tuples, so it’ll be using registers and not PTX’s local memory. You can inspect generated code using @device_code_ptx.

Topic		Replies	Views
Local thread memory in GPU using StaticArrays GPU question , gpu , cuda	4	6250	January 26, 2020
Use of static array in a kernel function GPU	1	1547	January 4, 2021
CUDAnative dynamic allocation GPU question , cudanative	5	1810	March 4, 2020
CUDA.jl - Sub-Vector Indexing Problem Inside CUDA Kernel GPU cuda , error , cuarrays , error-message , staticarrays	2	1242	March 28, 2022
CUDA.@cuStaticSharedMem returning a Structure of Arrays GPU	0	329	July 26, 2021

Thread local storage is not implemented

Related topics