I’ve been playing with Union-typed objects on CUDA GPU and I found an interesting quirk due to how sync_threads
and Union
-splitting interacts. I thought to share this here since it’s not really a CUDA.jl issue.
MWE
If CUDA.sync_threads
is used inside of a function that may get arguments with different type across multiple threads, we get a dead lock.
using CUDA
function kernel()
x = threadIdx().x
if isodd(x)
x = Float32(x)
else
x = Int32(x)
end
# Using foldl for, e.g., loop unrolling:
x = Base.afoldl(x, 1, 2) do x, i
@cushow x
x += i
sync_threads()
x
end
return
end
We get a deadlock with multiple threads:
julia> CUDA.@sync @cuda threads=1 kernel();
x = 1.000000
x = 2.000000
julia> CUDA.@sync @cuda threads=2 kernel();
x = 1.000000
x = 2
^CERROR: InterruptException:
Stacktrace:
[1] process_events
@ ./libuv.jl:104 [inlined]
[2] wait
@ ./task.jl:765 [inlined]
[3] yield()
@ Base ./task.jl:657
[4] synchronize(s::CuStream; blocking::Bool)
@ CUDA ~/.julia/dev/CUDA/lib/cudadrv/stream.jl:119
[5] top-level scope
@ ~/.julia/dev/CUDA/src/utilities.jl:29
julia> exit()
^C^C^C^C^C^C^CWARNING: Force throwing a SIGINT
error in running finalizer: InterruptException()
unknown function (ip: 0x7f4d10f312b7)
unknown function (ip: 0x7f4d10db2dfe)
unknown function (ip: 0x7f4d10d98444)
unknown function (ip: 0x7f4d1100aad5)
unknown function (ip: 0x7f4d1100b5c0)
unknown function (ip: 0x7f4d10e1c1a6)
unknown function (ip: 0x7f4d10d987c6)
unknown function (ip: 0x7f4d10d73717)
cuModuleUnload at /usr/lib/x86_64-linux-gnu/libcuda.so.1 (unknown line)
macro expansion at /home/tkf/.julia/dev/CUDA/lib/cudadrv/libcuda.jl:272 [inlined]
macro expansion at /home/tkf/.julia/dev/CUDA/lib/cudadrv/error.jl:94 [inlined]
cuModuleUnload at /home/tkf/.julia/dev/CUDA/lib/utils/call.jl:26 [inlined]
macro expansion at /home/tkf/.julia/dev/CUDA/lib/cudadrv/context.jl:162 [inlined]
unsafe_unload! at /home/tkf/.julia/dev/CUDA/lib/cudadrv/module.jl:78
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2237 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2419
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1703 [inlined]
run_finalizer at /buildworker/worker/package_linux64/build/src/gc.c:278
jl_gc_run_finalizers_in_list at /buildworker/worker/package_linux64/build/src/gc.c:365
run_finalizers at /buildworker/worker/package_linux64/build/src/gc.c:394 [inlined]
jl_gc_run_all_finalizers at /buildworker/worker/package_linux64/build/src/gc.c:480
jl_atexit_hook at /buildworker/worker/package_linux64/build/src/init.c:240
repl_entrypoint at /buildworker/worker/package_linux64/build/src/jlapi.c:703
main at julia1.6 (unknown line)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x4007d8)
According to B.6. Synchronization Functions :: Programming Guide :: CUDA Toolkit Documentation,
__syncthreads()
is allowed in conditional code but only if the conditional evaluates identically across the entire thread block, otherwise the code execution is likely to hang or produce unintended side effects.
So, since a function call on “Union
-typed variable” is actually a conditional, the example above invokes this undefined behavior.
Some random comments:
-
In general, I guess we should be very careful when using
sync_threads
inside a function, includingdo
blocks. (Or just avoid it.) -
sync_warp
does not have this problem. -
Are there any other cases where
Union
splitting can kick in due to some aggressive optimizations, even without the function boundaries? -
Does anybody know why
__syncthreads
is execution mask-aware? I find it strange, but is it just have to be this way due to some GPU hardware limitation?