sharedMemory in GPU programming examples

Hello!

I found this great ressource here:

https://jenni-westoby.github.io/Julia_GPU_examples/dev/Vector_dot_product/

And am going through the motion of understanding this. In the mean while I hope some could provide me other examples where sharedMemory in CUDA.jl is used.

What I really seek is to see how sharedMemory can be used to return a vector of values instead of a final scalar at the end - in general all examples are appreciated :slight_smile:

Kind regards

Sorry what exactly do you mean by “sharedMemory” returning a vector of values instead of a scalar?

In the dot product example of your resource, the macro in the line

cache = @cuDynamicSharedMem(Int64, threadsPerBlock)

effectively returns a vector of length threadsPerBlock that points to a statically allocated region of shared memory. Is this not what you mean?

1 Like

This shared memory is only temporary and is only accessible by the same threads in a single block. This is useful when performing reductions where the threads need access to some shared memory.

If you want a vector to be “returned” by the kernel, you need to allocate this memory before the kernel is launched and then have the memory written into by the kernel so that it has the output you want by the end of the kernel.

When the kernel ends execution, all the shared memory allocated gets cleaned up.

This is an (unfortunately longer) example of using shared memory to calculate \pi with a Monte Carlo method that I had lying around, which you can look at to understand the shared memory:

using CUDA
using BenchmarkTools
function _est_pi_gpu_fast!(totals, n)
    block_counts = CUDA.@cuStaticSharedMem(UInt16, 256)
    idx = threadIdx().x
    hit_count = UInt16(0)
    for _ in 1:n
        x = rand(Float32)
        y = rand(Float32)
        is_hit = (x*x+y*y<=1)

        hit_count += UInt16(is_hit)
    end
    block_counts[idx] = hit_count

    step = blockDim().x ÷ 2
    while (step != 0)
        CUDA.sync_threads()
        if idx <= step
            block_counts[idx] += block_counts[idx + step]
        end
        step ÷= 2
    end

    if idx == 1
        CUDA.@atomic totals[1] += block_counts[1]
    end
    
    nothing
end

function est_pi_gpu_fast(n)
    threads = 256
    n_per_thread = 16
    blocks = cld(n, threads*n_per_thread)
    total_num = threads * n_per_thread * blocks
    total = CuArray{UInt32}([0])
    @cuda threads=threads blocks=blocks _est_pi_gpu_fast!(total, n_per_thread)
    gpu_total = UInt32(0)
    CUDA.@allowscalar begin
        gpu_total = total[]
    end
    CUDA.unsafe_free!(total)
    return 4 * gpu_total / total_num
end
1 Like

Thank you both @loonatick-src and @jmair

Perhaps you are right that sharedMemory is not what I need. My problem is at a theoretical level as such;

I am programming a cell-linked list to find the nearest neighbours of a set of 2/3D points. The algorithm is working on GPU and as far as I can see right now, performing surprisingly well. The issue is that when I need to update the preallocated arrays with the calculate data, then I need to use @cuda_atomic operations, to lock the array - this slows down the code by a factor of x100-x1000.

I hope to be able to post a cleaner version on my current code on github and produce a new question with more details - perhaps someone is able to help me get it to the finish line then :slight_smile:

Kind regards

1 Like