GPU Synchronization Issue - using KernelAbstraction

@synchronize is not working consistently for this function I am trying to write, here is what is happening. This function is supposed to first divide everything in the first column by the topmost element of column 1 (3), then synchronizes, then copies the second and third elements of column 1 into the second and third elements of columns 2 and 3.

About half the time I run it the correct answer (a matrix where the second column is 1 2 3) comes out, other half of the time incorrect answer (matrix where second column is 1 6 9) comes out. However, it always copies column 3 correctly.

Does anyone know what is happening and how I can fix this? Thank you!

Screenshot 2023-11-22 at 5.04.33 PM

A = CuArray([3.0 1.0 1.0; 6.0 4.0 2.0; 9.0 7.0 7.0])
backend = get_backend(A)
@kernel function test_gpu!(A)
    I, J = @index(Global, NTuple)
        
    if I <= 2 && J == 1
        A[I+1, 1] = A[I+1, 1]/A[1, 1]
        
    end
    @synchronize
        
        
    if I > 1 && I <= 3 && J <= 3 && J > 1
        A[I, J] = A[I, 1]
    end
    @synchronize

   
end

test_gpu!(backend, 64)(A, ndrange = (3,3))
A

(Also, happy Thanksgiving to everyone celebrating!)

1 Like

KA.jl is launching multiple blocks here, and @synchronize (aka. CUDA.sync_threads()) only synchronizes threads within a block.

1 Like

Thank you for the explanation! Do you know if there’s a way to make KA launch just one block at a time? I couldn’t find anything in the documentation.

If your workgroup size matches the ndrange, you will only get a single block. So pass (3,3) instead of 64.

This worked, thank you so much!

Hi, thanks for the question and answer above! Is the original code:

test_gpu!(backend, 64)(A, ndrange = (3,3))

launching 3 workgroups of size 3x1?