Hi,
I want to write a kernel that will filter values in input arrays of predefined length based on some condition. Also, I would like to have “good” values (which fulfill the condition) at the beginning of this arrays. I am saving the number of “good” values in one-element CuArray as I need to be able to modify it inside kernels.
Simplified example:
N = 69_000
L = 1_000_000
n = cu([N])
arr = vcat(CUDA.rand(N), CUDA.zeros(L-N));
So I have 1M element CuArray in which the first 69k values are the values of interest (to be checked).
As I would like to make this filtering in-place, I came up with an idea to store all the values of interest in the shared memory cache and then put back to global memory only these which fulfill the condition.
function kernel(x, n)
tid = threadIdx().x
gid = (blockIdx().x - 1) * blockDim().x + threadIdx().x
cache = @cuDynamicSharedMem(Float32, blockDim().x)
# copy number of values
N = n[1]
sync_threads()
if gid == 1
n[1] = 0
end
# copy values to chache
if gid <= N
cache[tid] = x[gid]
end
sync_threads()
# save values which fulfill condition
if (id <= N) && (cache[tid] > 0.5)
idx = CUDA.atomic_add!(pointer(n, 1), Int64(1)) + 1
x[idx] = cache[tid]
end
return nothing
end
Let’s run it:
# save correct solution for testing
test = arr[findall(x->x>0.5, arr[1:N])];
ntest = sum(x->x>0.5, arr[1:N]);
# run kernel
mykernel = @cuda name="boudary_kernel" launch=false kernel(arr,n)
config = launch_configuration(mykernel.fun)
threads = Base.min(N, config.threads)
blocks = cld(N, threads)
shmem = threads*sizeof(Float32)
mykernel(arr,n; threads=threads, blocks=blocks, shmem=shmem)
CUDA.synchronize()
Tests:
CUDA.@allowscalar n[1] == ntest # true
CUDA.@allowscalar all(x->x>0.5, view(arr,1:n[1])) # true
CUDA.@allowscalar Set(test) == Set(arr[1:n[1]]) # true
Everything works for 69k values and even further until we exceed 1024*68 = 69632
values. Above that, the number of “good” values (stored in n
) stops incrementing. Other tests fail as well. I was trying to spot the problem, but I think that I am missing some knowledge as exceeding the 69632
values results in starting 1024 threads times 69 blocks. So the number of blocks exceeds the number of SM in my GPU (RTX2080Ti), so maybe I can not do this that way, or some additional synchronization is needed?
Or maybe there is some better way to achieve my goal? It seems to me as a quite usual problem, so maybe there is some pattern solving this?