@synchronize causing silent failures in function

Hello! I’ve been having this strange issue. Running the following code in a Jupyter notebook causes it to be forever stuck on the last line:

using KernelAbstractions
using CUDA

@kernel function tile_lu_factor!(A, n)
    
    I, J, K, L = @index(Global, NTuple)
    
    for k = 1:2    
        
        if K == 1 && L == 1
            for k = 1:n
        
                @synchronize

            end
        end
        
        if K == 1 && L <= 3-k
            for k = 1:n
        
                @synchronize

            end
        end
    end
end

A = CuArray(rand(2, 2))
backend = get_backend(A)

tile_lu_factor!(backend, (2, 2, 1, 3))(A, 2, ndrange = (2, 2, 1, 3))

A_not_gpu = Array(A)

However when I replace the n with a 2 on the for loop instead, it is able to finish:

for k = 1:2    
        
        if K == 1 && L == 1
            for k = 1:2
        
                @synchronize

            end
        end
        
        if K == 1 && L <= 3-k
            for k = 1:2
        
                @synchronize

            end
        end
    end

What is going on?

I don’t know what the error is, but one thing to notice is that you used k as the index of both the inner loops and the outer loops. That should not cause any issues, but it could maybe confuse people.

Could you explain why all the threads will hit the same number of @synchronize calls? Looking at the code, it seems that those with lower L values and K=1 will synchronize more often than other threads, which means they would wait forever.

Thanks for the suggestions, I changed the loop variable for clarity and added else statements so every thread should synchronize the same number of times. However, it is still getting stuck when I try to access A after running the function.

Here is the new function:

@kernel function tile_lu_factor!(A, n)
    
    I, J, K, L = @index(Global, NTuple)
    
    for s = 1:2    
        
        if K == 1 && L == 1
            for s = 1:n
        
                @synchronize

            end
        else 
            for s = 1:n
        
                @synchronize

            end
        end
        
        if K == 1 && L <= 3-s
            for s = 1:n
        
                @synchronize

            end
        else
            for s = 1:n
        
                @synchronize

            end
        end
    end
end

You’re not allowed to synchronize from a divergent context (i.e. from branches that are not taken uniformly across a warp); generally all threads have to reach the barrier, or you risk deadlocks.

Got it, thanks!