sharedMemory in GPU programming examples

Ahmed_Salih · March 5, 2023, 6:50am

Hello!

I found this great ressource here:

https://jenni-westoby.github.io/Julia_GPU_examples/dev/Vector_dot_product/

And am going through the motion of understanding this. In the mean while I hope some could provide me other examples where sharedMemory in CUDA.jl is used.

What I really seek is to see how sharedMemory can be used to return a vector of values instead of a final scalar at the end - in general all examples are appreciated

Kind regards

loonatick-src · March 5, 2023, 10:08am

Sorry what exactly do you mean by “sharedMemory” returning a vector of values instead of a scalar?

In the dot product example of your resource, the macro in the line

cache = @cuDynamicSharedMem(Int64, threadsPerBlock)

effectively returns a vector of length threadsPerBlock that points to a statically allocated region of shared memory. Is this not what you mean?

jmair · March 5, 2023, 10:28am

This shared memory is only temporary and is only accessible by the same threads in a single block. This is useful when performing reductions where the threads need access to some shared memory.

If you want a vector to be “returned” by the kernel, you need to allocate this memory before the kernel is launched and then have the memory written into by the kernel so that it has the output you want by the end of the kernel.

When the kernel ends execution, all the shared memory allocated gets cleaned up.

This is an (unfortunately longer) example of using shared memory to calculate \pi with a Monte Carlo method that I had lying around, which you can look at to understand the shared memory:

using CUDA
using BenchmarkTools
function _est_pi_gpu_fast!(totals, n)
    block_counts = CUDA.@cuStaticSharedMem(UInt16, 256)
    idx = threadIdx().x
    hit_count = UInt16(0)
    for _ in 1:n
        x = rand(Float32)
        y = rand(Float32)
        is_hit = (x*x+y*y<=1)

        hit_count += UInt16(is_hit)
    end
    block_counts[idx] = hit_count

    step = blockDim().x ÷ 2
    while (step != 0)
        CUDA.sync_threads()
        if idx <= step
            block_counts[idx] += block_counts[idx + step]
        end
        step ÷= 2
    end

    if idx == 1
        CUDA.@atomic totals[1] += block_counts[1]
    end
    
    nothing
end

function est_pi_gpu_fast(n)
    threads = 256
    n_per_thread = 16
    blocks = cld(n, threads*n_per_thread)
    total_num = threads * n_per_thread * blocks
    total = CuArray{UInt32}([0])
    @cuda threads=threads blocks=blocks _est_pi_gpu_fast!(total, n_per_thread)
    gpu_total = UInt32(0)
    CUDA.@allowscalar begin
        gpu_total = total[]
    end
    CUDA.unsafe_free!(total)
    return 4 * gpu_total / total_num
end

Ahmed_Salih · March 7, 2023, 6:32pm

Thank you both @loonatick-src and @jmair

Perhaps you are right that sharedMemory is not what I need. My problem is at a theoretical level as such;

I am programming a cell-linked list to find the nearest neighbours of a set of 2/3D points. The algorithm is working on GPU and as far as I can see right now, performing surprisingly well. The issue is that when I need to update the preallocated arrays with the calculate data, then I need to use @cuda_atomic operations, to lock the array - this slows down the code by a factor of x100-x1000.

I hope to be able to post a cleaner version on my current code on github and produce a new question with more details - perhaps someone is able to help me get it to the finish line then

Kind regards

Topic		Replies	Views
Trying to understand the use of shared memory on GPUs GPU	3	2290	May 25, 2021
Initializing @cuStaticSharedMem array? GPU	3	1363	May 12, 2018
@cuDynamicSharedMem : allocating beforehand? GPU	2	1355	January 2, 2018
Kernel optimization and shared memory GPU	1	442	July 9, 2021
CuDynamicSharedArray error GPU gpu	2	532	November 25, 2021

sharedMemory in GPU programming examples

Related topics