Kernel fails when number of blocks exceeds number of SM's (?)

Hi,

I want to write a kernel that will filter values in input arrays of predefined length based on some condition. Also, I would like to have “good” values (which fulfill the condition) at the beginning of this arrays. I am saving the number of “good” values in one-element CuArray as I need to be able to modify it inside kernels.
Simplified example:

N = 69_000
L = 1_000_000

n = cu([N])
arr = vcat(CUDA.rand(N), CUDA.zeros(L-N));

So I have 1M element CuArray in which the first 69k values are the values of interest (to be checked).
As I would like to make this filtering in-place, I came up with an idea to store all the values of interest in the shared memory cache and then put back to global memory only these which fulfill the condition.

function kernel(x, n)
    tid = threadIdx().x
    gid = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    
    cache = @cuDynamicSharedMem(Float32, blockDim().x)
    
    # copy number of values
    N = n[1]
    
    sync_threads()
    
    if gid == 1
        n[1] = 0
    end
    
    # copy values to chache
    if gid <= N
        cache[tid] = x[gid]
    end
    
    sync_threads()
    
    # save values which fulfill condition
    if (id <= N) && (cache[tid] > 0.5)
        idx = CUDA.atomic_add!(pointer(n, 1), Int64(1)) + 1
        x[idx] = cache[tid]
    end
            
    return nothing
end

Let’s run it:

# save correct solution for testing
test = arr[findall(x->x>0.5, arr[1:N])];
ntest = sum(x->x>0.5, arr[1:N]);

# run kernel
mykernel = @cuda name="boudary_kernel" launch=false kernel(arr,n)
config = launch_configuration(mykernel.fun)
threads = Base.min(N, config.threads)
blocks = cld(N, threads)
shmem = threads*sizeof(Float32)
mykernel(arr,n; threads=threads, blocks=blocks, shmem=shmem)
CUDA.synchronize()

Tests:

CUDA.@allowscalar n[1] == ntest # true
CUDA.@allowscalar all(x->x>0.5, view(arr,1:n[1])) # true
CUDA.@allowscalar Set(test) == Set(arr[1:n[1]]) # true

Everything works for 69k values and even further until we exceed 1024*68 = 69632 values. Above that, the number of “good” values (stored in n) stops incrementing. Other tests fail as well. I was trying to spot the problem, but I think that I am missing some knowledge as exceeding the 69632 values results in starting 1024 threads times 69 blocks. So the number of blocks exceeds the number of SM in my GPU (RTX2080Ti), so maybe I can not do this that way, or some additional synchronization is needed?

Or maybe there is some better way to achieve my goal? It seems to me as a quite usual problem, so maybe there is some pattern solving this?

Blocks run out of order; sync_threads only synchronizes within the threads of a block (i.e. not across blocks). So you can not read n[1] and reset/atomic_add it from another thread.

That said, findall and sum are available for CuArray, so can’t you just use those like you do for the Array implementation?

I would need something similar to sort!(arr, by=x->x>0.5) which I was using before, but the performance was very bad compared to just rewriting “good” elements of source array to another. Something like that:

N = 69_000
L = 1_000_000

n1 = cu([N])
arr1 = vcat(CUDA.rand(N), CUDA.zeros(L-N));
n2 = cu([0])
arr2 = similar(arr1)

function kernel(arr1, arr2, n1, n2)
    index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = blockDim().x * gridDim().x
    for k=index:stride:n1[1]
        if arr1[k] > 0.5
            idx = CUDA.atomic_add!(pointer(n2, 1), Int64(1)) + 1
            arr2[idx] = arr1[k]
        end
    end
    return nothing
end

But it required rewriting the arrays back again and I had to allocate twice the space. Another problem is that I need to perform this “sorting” operation on many arrays simultaneously based on the values of one of them. Packing them all to something which could be accepted by sort! was too much overhead as I need to perform this operation in every iteration of my algorithm (and it can be tens of millions of iterations in a single simulation). That’s also the reason why I preallocate arrays of safe length 1_000_000, cause I can’t afford allocating in every iteration when number of values of interest is changing.

I have seen sync_grid() in documentation. So sync_grid(this_grid()) should work?

No, you generally can’t synchronize the grid. There’s a reason these blocks don’t get scheduled all together: it doesn’t fit your GPU. To synchronize the grid, all blocks need to be active, which is impossible here. You could change your kernel to loop instead, so that you only require a number of blocks (or even just one) that fits your GPU.