Hi,
When experimenting a bit with cooperative groups and in particular coalesced groups, I’ve encountered some strange behaviour which I cannot explain. As a simple example, consider
using CUDA
using CUDA: i32
function my_sum(out, mini_data)
# length(mini_data) < 32: fits in a single warp, but not all threads are active
tidx = threadIdx().x + (blockIdx().x - 1i32) * blockDim().x # (or just threadIdx().x)
tidx > length(mini_data) && return
val = mini_data[tidx]
cg = CG.coalesced_threads()
if tidx == 1
@cuprintln("num_threads in coalesced group: ", CG.num_threads(cg), "\n")
end
stride = 1i32
# (Because we're using local memory, the memory order corresponding to increasing strides should be fine)
while stride < CG.num_threads(cg)
higher_cg_lane = CG.thread_rank(cg) + stride
@cuprintln(CG.thread_rank(cg), " (", val, ")", " takes from ", higher_cg_lane, " (if inbounds)")
# CG.sync(cg) # Synchronisation necessary? Causes a deadlock... (?)
v = if higher_cg_lane <= CG.num_threads(cg)
CG.shfl(cg, val, higher_cg_lane) # (or CG.shfl_down(cg, val, stride) )
else
0.f0
end
# CG.sync(cg)
val += v
stride *= 2
if CG.thread_rank(cg) == 1
@cuprintln()
end
end
if CG.thread_rank(cg) == 1
out[1] = val
end
return
end
mini_data = cu(Float32.(1:8))
out = CUDA.zeros(Float32, 1)
CUDA.@sync @cuda threads=32 blocks=1 my_sum(out, mini_data)
CUDA.@allowscalar println("Expected: $(sum(mini_data)),\t Got: $(out[1])")
where we just calculate sum(1:8) == 36
with a standard reduction, but using a coalesced group. However, the output
num_threads in coalesced group: 8
1 (1.000000) takes from 2 (if inbounds)
2 (2.000000) takes from 3 (if inbounds)
3 (3.000000) takes from 4 (if inbounds)
4 (4.000000) takes from 5 (if inbounds)
5 (5.000000) takes from 6 (if inbounds)
6 (6.000000) takes from 7 (if inbounds)
7 (7.000000) takes from 8 (if inbounds)
8 (8.000000) takes from 9 (if inbounds)
8 (8.000000) takes from 10 (if inbounds)
8 (8.000000) takes from 12 (if inbounds)
1 (3.000000) takes from 3 (if inbounds)
2 (5.000000) takes from 4 (if inbounds)
3 (7.000000) takes from 5 (if inbounds)
4 (9.000000) takes from 6 (if inbounds)
5 (11.000000) takes from 7 (if inbounds)
6 (13.000000) takes from 8 (if inbounds)
7 (7.000000) takes from 9 (if inbounds)
7 (7.000000) takes from 11 (if inbounds)
1 (10.000000) takes from 5 (if inbounds)
2 (14.000000) takes from 6 (if inbounds)
3 (18.000000) takes from 7 (if inbounds)
4 (22.000000) takes from 8 (if inbounds)
5 (11.000000) takes from 9 (if inbounds)
6 (13.000000) takes from 10 (if inbounds)
Expected: 36.0, Got: 10.0
is very strange. First of all, the returned sum is obviously incorrect. Looking at threads 1-6 in the first iteration step, everything is fine. But thread 7 somehow fails to get the CG.shfl
val
from thread 8. Also, thread 8 appears thrice, for (presumably) three stride
steps. The latter looks like a synchronisation issue, but adding CG.sync(cg)
causes a deadlock. In any case, the issues at the higher lanes then propagate downwards in further iterations.
- Can anyone explain what is happening here, and how to fix it?
- In general, should one explicitly synchronise around
CG.shfl
, or is this (despite the name) more akin toCUDA.shfl_sync
?- The implementation of
reduce_sum_tile_shfl
in NVIDIA’s blog post on cooperative groups does not synchronise. But I guess they assumetile_sz
is a power of two, in which case there is no need for a conditional, so that everything automatically runs in sync?
- The implementation of
versioninfo
julia> versioninfo()
Julia Version 1.11.2
Commit 5e9a32e7af (2024-12-01 20:02 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 8 × Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, skylake)
Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores)
Environment:
JULIA_NUM_THREADS = auto
julia> CUDA.versioninfo()
CUDA runtime 12.6, artifact installation
CUDA driver 12.7
NVIDIA driver 566.14.0
CUDA libraries:
- CUBLAS: 12.6.4
- CURAND: 10.3.7
- CUFFT: 11.3.0
- CUSOLVER: 11.7.1
- CUSPARSE: 12.5.4
- CUPTI: 2024.3.2 (API 24.0.0)
- NVML: 12.0.0+566.14
Julia packages:
- CUDA: 5.5.2
- CUDA_Driver_jll: 0.10.4+0
- CUDA_Runtime_jll: 0.15.5+0
Toolchain:
- Julia: 1.11.2
- LLVM: 16.0.6
1 device:
0: NVIDIA GeForce RTX 3070 (sm_86, 5.853 GiB / 8.000 GiB available)