Notes on `CUDA.sync_threads` and dispatch on `Union`

I’ve been playing with Union-typed objects on CUDA GPU and I found an interesting quirk due to how sync_threads and Union-splitting interacts. I thought to share this here since it’s not really a CUDA.jl issue.


If CUDA.sync_threads is used inside of a function that may get arguments with different type across multiple threads, we get a dead lock.

using CUDA

function kernel()
    x = threadIdx().x
    if isodd(x)
        x = Float32(x)
        x = Int32(x)
    # Using foldl for, e.g., loop unrolling:
    x = Base.afoldl(x, 1, 2) do x, i
        @cushow x
        x += i

We get a deadlock with multiple threads:

julia> CUDA.@sync @cuda threads=1 kernel();
x = 1.000000
x = 2.000000

julia> CUDA.@sync @cuda threads=2 kernel();
x = 1.000000
x = 2
^CERROR: InterruptException:
 [1] process_events
   @ ./libuv.jl:104 [inlined]
 [2] wait
   @ ./task.jl:765 [inlined]
 [3] yield()
   @ Base ./task.jl:657
 [4] synchronize(s::CuStream; blocking::Bool)
   @ CUDA ~/.julia/dev/CUDA/lib/cudadrv/stream.jl:119
 [5] top-level scope
   @ ~/.julia/dev/CUDA/src/utilities.jl:29

julia> exit()
^C^C^C^C^C^C^CWARNING: Force throwing a SIGINT
error in running finalizer: InterruptException()
unknown function (ip: 0x7f4d10f312b7)
unknown function (ip: 0x7f4d10db2dfe)
unknown function (ip: 0x7f4d10d98444)
unknown function (ip: 0x7f4d1100aad5)
unknown function (ip: 0x7f4d1100b5c0)
unknown function (ip: 0x7f4d10e1c1a6)
unknown function (ip: 0x7f4d10d987c6)
unknown function (ip: 0x7f4d10d73717)
cuModuleUnload at /usr/lib/x86_64-linux-gnu/ (unknown line)
macro expansion at /home/tkf/.julia/dev/CUDA/lib/cudadrv/libcuda.jl:272 [inlined]
macro expansion at /home/tkf/.julia/dev/CUDA/lib/cudadrv/error.jl:94 [inlined]
cuModuleUnload at /home/tkf/.julia/dev/CUDA/lib/utils/call.jl:26 [inlined]
macro expansion at /home/tkf/.julia/dev/CUDA/lib/cudadrv/context.jl:162 [inlined]
unsafe_unload! at /home/tkf/.julia/dev/CUDA/lib/cudadrv/module.jl:78
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2237 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2419
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1703 [inlined]
run_finalizer at /buildworker/worker/package_linux64/build/src/gc.c:278
jl_gc_run_finalizers_in_list at /buildworker/worker/package_linux64/build/src/gc.c:365
run_finalizers at /buildworker/worker/package_linux64/build/src/gc.c:394 [inlined]
jl_gc_run_all_finalizers at /buildworker/worker/package_linux64/build/src/gc.c:480
jl_atexit_hook at /buildworker/worker/package_linux64/build/src/init.c:240
repl_entrypoint at /buildworker/worker/package_linux64/build/src/jlapi.c:703
main at julia1.6 (unknown line)
__libc_start_main at /lib/x86_64-linux-gnu/ (unknown line)
unknown function (ip: 0x4007d8)

According to B.6. Synchronization Functions :: Programming Guide :: CUDA Toolkit Documentation,

__syncthreads() is allowed in conditional code but only if the conditional evaluates identically across the entire thread block, otherwise the code execution is likely to hang or produce unintended side effects.

So, since a function call on “Union-typed variable” is actually a conditional, the example above invokes this undefined behavior.

Some random comments:

  • In general, I guess we should be very careful when using sync_threads inside a function, including do blocks. (Or just avoid it.)

  • sync_warp does not have this problem.

  • Are there any other cases where Union splitting can kick in due to some aggressive optimizations, even without the function boundaries?

  • Does anybody know why __syncthreads is execution mask-aware? I find it strange, but is it just have to be this way due to some GPU hardware limitation?

Actually, how sync_threads works seems to be platform-specific. On AMD GPU, sync_workgroup on different branches can synchronize each other:

julia> using AMDGPU

julia> xs = ROCVector{Int32}(undef, 2);

julia> function kernel!(xs)
           bool = isodd(workitemIdx().x)
           if bool
               xs[workitemIdx().x] = 1
               xs[workitemIdx().x] = 2
kernel! (generic function with 1 method)

julia> wait(@roc groupsize=2 kernel!(xs));

julia> xs
2-element ROCVector{Int32}:

(But, since ROCm documentation does not explicitly define how __syncthreads works with conditional, I don’t know if this is a well-defined code.)

A similar code in CUDA hangs:

julia> using CUDA

julia> function kernel!(xs)
           bool = isodd(threadIdx().x)
           if bool
               xs[threadIdx().x] = 1
               xs[threadIdx().x] = 2
kernel! (generic function with 1 method)

julia> xs = CUDA.zeros(2)
       CUDA.@sync @cuda threads=2 kernel!(xs)
^C^C^C^C^CWARNING: Force throwing a SIGINT

Nice MWE. Did you see Synchronization in union-split code paths · Issue #797 · JuliaGPU/CUDA.jl · GitHub and Implement sync_threads using an unaligned barrier. by maleadt · Pull Request #798 · JuliaGPU/CUDA.jl · GitHub?

Oh, great! :slight_smile: Sorry, I totally missed it. I didn’t notice that you are already working on this!