Trying to understand the use of shared memory on GPUs

Ferran_Mazzanti · May 16, 2021, 10:59am

Hi,

I’m trying to learn how to use shared memory on the GPU by following the very nice write-up in

https://jenni-westoby.github.io/Julia_GPU_examples/dev/Vector_dot_product/

however, I fail to understand basic things here. I’m referring to the Vector Dot Product example.

I have several questions but maybe some of them can be self-answered by answering the previous ones, so I won’t ask them all in adavnce

So the first question is: when you run the code as suggested

@cuda blocks = blocksPerGrid threads = threadsPerBlock shmem =
(threadsPerBlock * sizeof(Int64)) dot(a,b,c, N, threadsPerBlock, blocksPerGrid)

one must understand that all blocks and threads in each block will be exectude in parallel, right? So that the function dot(a,b,c,N,threadsPerBlock, blocksPerGrid)
is meant to be thought as what happens to one generic thread in one generic block. Am I right or wrong?

I ask this very basic thing becasue, not understanding well what happens, the code seems (to my dumb eyes) to be mixing things. For instance in the example you read

function dot(a,b,c, N, threadsPerBlock, blocksPerGrid)

    # Set up shared memory cache for this current block.
    cache = @cuDynamicSharedMem(Int64, threadsPerBlock)

and to my eyes this cache variable seems to make sense if I consider a whole block as being processed inside the function. Otherwise, if I consider this function as what happens to a certain thread in a certain block, it would be initializing an array for the block every time.

Now related to this, shall I understand this cache variable stores an array in every block?

On the other hand, if I’m right and I should look at the whole function as to what happens for a single thread in a single block, then why should it declare cache as an array for the whole block?

As you can see, I’m in a mess here I’m 100% sure I’m not understanding something quite basic, so I’ll appreciate it if somebody can shed some light here

Thanks for your patience,

Ferran.

maleadt · May 17, 2021, 12:53pm

The programming model is scalar, so the kernel function is evaluated for each thread you launch. However, shared memory is implicitly shared between the threads in each block you launch. For example, if you launch threads=5 blocks=10 you’ll end up with 50 threads in total, where each group of 5 threads in a block shares the cache variable you’d allocate.

This isn’t unique to CUDA.jl, so google CUDA shared memory if that’s still not clear, there’s a plethora of resources out there

Ferran_Mazzanti · May 24, 2021, 10:43am

Hi again,
sorry for the delay -too much work
Yes I understand, and you confirm it is the way I understand it, good. My question comes from things like

function dot(a,b,c, N, threadsPerBlock, blocksPerGrid)

    # Set up shared memory cache for this current block.
    cache = @cuDynamicSharedMem(Int64, threadsPerBlock)

so if it is a scalar model that executes at each launched thread, I do not really understand the cache line. To me it looks as if it reserving memory for all threads in a block, but for every thread in the same block. So is this reserving memory threadsPerBlock times in the same block?
Either I’m not understanding it, or the syntax is confusing, or both things
Best,
Ferran.

maleadt · May 25, 2021, 5:47am

Again, shared memory is implicitly shared across threads in a block. There’s no way to express this in syntax if the code is completely scalar. The same applies to CUDA C, so google CUDA shared memory for more details.

Topic		Replies	Views
sharedMemory in GPU programming examples GPU	3	618	March 7, 2023
I don't understand why it is slower with CuStaticSharedArray New to Julia gpu , cuda , sharedarrays , cudajl	9	282	March 17, 2025
Kernel optimization and shared memory GPU	1	435	July 9, 2021
Correct usage of shared memory？ GPU	5	850	January 20, 2024
Is sharedmemory really accelerates GPU kernel? Specific Domains gpu	1	87	December 2, 2024

Trying to understand the use of shared memory on GPUs

Related topics