I want to write a kernel that will filter values in input arrays of predefined length based on some condition. Also, I would like to have “good” values (which fulfill the condition) at the beginning of this arrays. I am saving the number of “good” values in one-element CuArray as I need to be able to modify it inside kernels.
N = 69_000 L = 1_000_000 n = cu([N]) arr = vcat(CUDA.rand(N), CUDA.zeros(L-N));
So I have 1M element CuArray in which the first 69k values are the values of interest (to be checked).
As I would like to make this filtering in-place, I came up with an idea to store all the values of interest in the shared memory cache and then put back to global memory only these which fulfill the condition.
function kernel(x, n) tid = threadIdx().x gid = (blockIdx().x - 1) * blockDim().x + threadIdx().x cache = @cuDynamicSharedMem(Float32, blockDim().x) # copy number of values N = n sync_threads() if gid == 1 n = 0 end # copy values to chache if gid <= N cache[tid] = x[gid] end sync_threads() # save values which fulfill condition if (id <= N) && (cache[tid] > 0.5) idx = CUDA.atomic_add!(pointer(n, 1), Int64(1)) + 1 x[idx] = cache[tid] end return nothing end
Let’s run it:
# save correct solution for testing test = arr[findall(x->x>0.5, arr[1:N])]; ntest = sum(x->x>0.5, arr[1:N]); # run kernel mykernel = @cuda name="boudary_kernel" launch=false kernel(arr,n) config = launch_configuration(mykernel.fun) threads = Base.min(N, config.threads) blocks = cld(N, threads) shmem = threads*sizeof(Float32) mykernel(arr,n; threads=threads, blocks=blocks, shmem=shmem) CUDA.synchronize()
CUDA.@allowscalar n == ntest # true CUDA.@allowscalar all(x->x>0.5, view(arr,1:n)) # true CUDA.@allowscalar Set(test) == Set(arr[1:n]) # true
Everything works for 69k values and even further until we exceed
1024*68 = 69632 values. Above that, the number of “good” values (stored in
n) stops incrementing. Other tests fail as well. I was trying to spot the problem, but I think that I am missing some knowledge as exceeding the
69632 values results in starting 1024 threads times 69 blocks. So the number of blocks exceeds the number of SM in my GPU (RTX2080Ti), so maybe I can not do this that way, or some additional synchronization is needed?
Or maybe there is some better way to achieve my goal? It seems to me as a quite usual problem, so maybe there is some pattern solving this?