Trying to understand the use of shared memory on GPUs


I’m trying to learn how to use shared memory on the GPU by following the very nice write-up in

however, I fail to understand basic things here. I’m referring to the Vector Dot Product example.

I have several questions but maybe some of them can be self-answered by answering the previous ones, so I won’t ask them all in adavnce :slight_smile:

So the first question is: when you run the code as suggested

@cuda blocks = blocksPerGrid threads = threadsPerBlock shmem =
(threadsPerBlock * sizeof(Int64)) dot(a,b,c, N, threadsPerBlock, blocksPerGrid)

one must understand that all blocks and threads in each block will be exectude in parallel, right? So that the function dot(a,b,c,N,threadsPerBlock, blocksPerGrid)
is meant to be thought as what happens to one generic thread in one generic block. Am I right or wrong?

I ask this very basic thing becasue, not understanding well what happens, the code seems (to my dumb eyes) to be mixing things. For instance in the example you read

function dot(a,b,c, N, threadsPerBlock, blocksPerGrid)

    # Set up shared memory cache for this current block.
    cache = @cuDynamicSharedMem(Int64, threadsPerBlock)

and to my eyes this cache variable seems to make sense if I consider a whole block as being processed inside the function. Otherwise, if I consider this function as what happens to a certain thread in a certain block, it would be initializing an array for the block every time.

Now related to this, shall I understand this cache variable stores an array in every block?

On the other hand, if I’m right and I should look at the whole function as to what happens for a single thread in a single block, then why should it declare cache as an array for the whole block?

As you can see, I’m in a mess here :frowning: I’m 100% sure I’m not understanding something quite basic, so I’ll appreciate it if somebody can shed some light here :slight_smile:

Thanks for your patience,


1 Like

The programming model is scalar, so the kernel function is evaluated for each thread you launch. However, shared memory is implicitly shared between the threads in each block you launch. For example, if you launch threads=5 blocks=10 you’ll end up with 50 threads in total, where each group of 5 threads in a block shares the cache variable you’d allocate.

This isn’t unique to CUDA.jl, so google CUDA shared memory if that’s still not clear, there’s a plethora of resources out there :slightly_smiling_face:


Hi again,
sorry for the delay -too much work :frowning:
Yes I understand, and you confirm it is the way I understand it, good. My question comes from things like

function dot(a,b,c, N, threadsPerBlock, blocksPerGrid)

    # Set up shared memory cache for this current block.
    cache = @cuDynamicSharedMem(Int64, threadsPerBlock)

so if it is a scalar model that executes at each launched thread, I do not really understand the cache line. To me it looks as if it reserving memory for all threads in a block, but for every thread in the same block. So is this reserving memory threadsPerBlock times in the same block?
Either I’m not understanding it, or the syntax is confusing, or both things :slight_smile:

Again, shared memory is implicitly shared across threads in a block. There’s no way to express this in syntax if the code is completely scalar. The same applies to CUDA C, so google CUDA shared memory for more details.