I’ve been playing with Union-typed objects on CUDA GPU and I found an interesting quirk due to how
Union-splitting interacts. I thought to share this here since it’s not really a CUDA.jl issue.
CUDA.sync_threads is used inside of a function that may get arguments with different type across multiple threads, we get a dead lock.
using CUDA function kernel() x = threadIdx().x if isodd(x) x = Float32(x) else x = Int32(x) end # Using foldl for, e.g., loop unrolling: x = Base.afoldl(x, 1, 2) do x, i @cushow x x += i sync_threads() x end return end
We get a deadlock with multiple threads:
julia> CUDA.@sync @cuda threads=1 kernel(); x = 1.000000 x = 2.000000 julia> CUDA.@sync @cuda threads=2 kernel(); x = 1.000000 x = 2 ^CERROR: InterruptException: Stacktrace:  process_events @ ./libuv.jl:104 [inlined]  wait @ ./task.jl:765 [inlined]  yield() @ Base ./task.jl:657  synchronize(s::CuStream; blocking::Bool) @ CUDA ~/.julia/dev/CUDA/lib/cudadrv/stream.jl:119  top-level scope @ ~/.julia/dev/CUDA/src/utilities.jl:29 julia> exit() ^C^C^C^C^C^C^CWARNING: Force throwing a SIGINT error in running finalizer: InterruptException() unknown function (ip: 0x7f4d10f312b7) unknown function (ip: 0x7f4d10db2dfe) unknown function (ip: 0x7f4d10d98444) unknown function (ip: 0x7f4d1100aad5) unknown function (ip: 0x7f4d1100b5c0) unknown function (ip: 0x7f4d10e1c1a6) unknown function (ip: 0x7f4d10d987c6) unknown function (ip: 0x7f4d10d73717) cuModuleUnload at /usr/lib/x86_64-linux-gnu/libcuda.so.1 (unknown line) macro expansion at /home/tkf/.julia/dev/CUDA/lib/cudadrv/libcuda.jl:272 [inlined] macro expansion at /home/tkf/.julia/dev/CUDA/lib/cudadrv/error.jl:94 [inlined] cuModuleUnload at /home/tkf/.julia/dev/CUDA/lib/utils/call.jl:26 [inlined] macro expansion at /home/tkf/.julia/dev/CUDA/lib/cudadrv/context.jl:162 [inlined] unsafe_unload! at /home/tkf/.julia/dev/CUDA/lib/cudadrv/module.jl:78 _jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2237 [inlined] jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2419 jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1703 [inlined] run_finalizer at /buildworker/worker/package_linux64/build/src/gc.c:278 jl_gc_run_finalizers_in_list at /buildworker/worker/package_linux64/build/src/gc.c:365 run_finalizers at /buildworker/worker/package_linux64/build/src/gc.c:394 [inlined] jl_gc_run_all_finalizers at /buildworker/worker/package_linux64/build/src/gc.c:480 jl_atexit_hook at /buildworker/worker/package_linux64/build/src/init.c:240 repl_entrypoint at /buildworker/worker/package_linux64/build/src/jlapi.c:703 main at julia1.6 (unknown line) __libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line) unknown function (ip: 0x4007d8)
__syncthreads()is allowed in conditional code but only if the conditional evaluates identically across the entire thread block, otherwise the code execution is likely to hang or produce unintended side effects.
So, since a function call on “
Union-typed variable” is actually a conditional, the example above invokes this undefined behavior.
Some random comments:
In general, I guess we should be very careful when using
sync_threadsinside a function, including
doblocks. (Or just avoid it.)
sync_warpdoes not have this problem.
Are there any other cases where
Unionsplitting can kick in due to some aggressive optimizations, even without the function boundaries?
Does anybody know why
__syncthreadsis execution mask-aware? I find it strange, but is it just have to be this way due to some GPU hardware limitation?