And am going through the motion of understanding this. In the mean while I hope some could provide me other examples where sharedMemory in CUDA.jl is used.
What I really seek is to see how sharedMemory can be used to return a vector of values instead of a final scalar at the end - in general all examples are appreciated
This shared memory is only temporary and is only accessible by the same threads in a single block. This is useful when performing reductions where the threads need access to some shared memory.
If you want a vector to be “returned” by the kernel, you need to allocate this memory before the kernel is launched and then have the memory written into by the kernel so that it has the output you want by the end of the kernel.
When the kernel ends execution, all the shared memory allocated gets cleaned up.
This is an (unfortunately longer) example of using shared memory to calculate \pi with a Monte Carlo method that I had lying around, which you can look at to understand the shared memory:
using CUDA
using BenchmarkTools
function _est_pi_gpu_fast!(totals, n)
block_counts = CUDA.@cuStaticSharedMem(UInt16, 256)
idx = threadIdx().x
hit_count = UInt16(0)
for _ in 1:n
x = rand(Float32)
y = rand(Float32)
is_hit = (x*x+y*y<=1)
hit_count += UInt16(is_hit)
end
block_counts[idx] = hit_count
step = blockDim().x ÷ 2
while (step != 0)
CUDA.sync_threads()
if idx <= step
block_counts[idx] += block_counts[idx + step]
end
step ÷= 2
end
if idx == 1
CUDA.@atomic totals[1] += block_counts[1]
end
nothing
end
function est_pi_gpu_fast(n)
threads = 256
n_per_thread = 16
blocks = cld(n, threads*n_per_thread)
total_num = threads * n_per_thread * blocks
total = CuArray{UInt32}([0])
@cuda threads=threads blocks=blocks _est_pi_gpu_fast!(total, n_per_thread)
gpu_total = UInt32(0)
CUDA.@allowscalar begin
gpu_total = total[]
end
CUDA.unsafe_free!(total)
return 4 * gpu_total / total_num
end
Perhaps you are right that sharedMemory is not what I need. My problem is at a theoretical level as such;
I am programming a cell-linked list to find the nearest neighbours of a set of 2/3D points. The algorithm is working on GPU and as far as I can see right now, performing surprisingly well. The issue is that when I need to update the preallocated arrays with the calculate data, then I need to use @cuda_atomic operations, to lock the array - this slows down the code by a factor of x100-x1000.
I hope to be able to post a cleaner version on my current code on github and produce a new question with more details - perhaps someone is able to help me get it to the finish line then