Unexpected coalesced group behaviour in CUDA.jl

Hi,

When experimenting a bit with cooperative groups and in particular coalesced groups, I’ve encountered some strange behaviour which I cannot explain. As a simple example, consider

using CUDA
using CUDA: i32

function my_sum(out, mini_data)
	# length(mini_data) < 32: fits in a single warp, but not all threads are active
	tidx = threadIdx().x + (blockIdx().x - 1i32) * blockDim().x  # (or just threadIdx().x)
	tidx > length(mini_data) && return
	
	val = mini_data[tidx]
	
	cg = CG.coalesced_threads()
	if tidx == 1
		@cuprintln("num_threads in coalesced group: ", CG.num_threads(cg), "\n")
	end
	
	stride = 1i32
	# (Because we're using local memory, the memory order corresponding to increasing strides should be fine)
	while stride < CG.num_threads(cg)
		higher_cg_lane = CG.thread_rank(cg) + stride
		@cuprintln(CG.thread_rank(cg), " (", val, ")", " takes from ", higher_cg_lane, " (if inbounds)")
		
		# CG.sync(cg)    # Synchronisation necessary? Causes a deadlock... (?)
		v = if higher_cg_lane <= CG.num_threads(cg)
			CG.shfl(cg, val, higher_cg_lane)  # (or CG.shfl_down(cg, val, stride) )
		else
			0.f0
		end
		# CG.sync(cg)
		val += v
		
		stride *= 2
		if CG.thread_rank(cg) == 1
			@cuprintln()
		end
	end
	if CG.thread_rank(cg) == 1
		out[1] = val
	end
	return
end

mini_data = cu(Float32.(1:8))
out = CUDA.zeros(Float32, 1)
CUDA.@sync @cuda threads=32 blocks=1 my_sum(out, mini_data)
CUDA.@allowscalar println("Expected: $(sum(mini_data)),\t Got: $(out[1])")

where we just calculate sum(1:8) == 36 with a standard reduction, but using a coalesced group. However, the output

num_threads in coalesced group: 8

1 (1.000000) takes from 2 (if inbounds)
2 (2.000000) takes from 3 (if inbounds)
3 (3.000000) takes from 4 (if inbounds)
4 (4.000000) takes from 5 (if inbounds)
5 (5.000000) takes from 6 (if inbounds)
6 (6.000000) takes from 7 (if inbounds)
7 (7.000000) takes from 8 (if inbounds)
8 (8.000000) takes from 9 (if inbounds)
8 (8.000000) takes from 10 (if inbounds)
8 (8.000000) takes from 12 (if inbounds)

1 (3.000000) takes from 3 (if inbounds)
2 (5.000000) takes from 4 (if inbounds)
3 (7.000000) takes from 5 (if inbounds)
4 (9.000000) takes from 6 (if inbounds)
5 (11.000000) takes from 7 (if inbounds)
6 (13.000000) takes from 8 (if inbounds)
7 (7.000000) takes from 9 (if inbounds)
7 (7.000000) takes from 11 (if inbounds)

1 (10.000000) takes from 5 (if inbounds)
2 (14.000000) takes from 6 (if inbounds)
3 (18.000000) takes from 7 (if inbounds)
4 (22.000000) takes from 8 (if inbounds)
5 (11.000000) takes from 9 (if inbounds)
6 (13.000000) takes from 10 (if inbounds)

Expected: 36.0,  Got: 10.0

is very strange. First of all, the returned sum is obviously incorrect. Looking at threads 1-6 in the first iteration step, everything is fine. But thread 7 somehow fails to get the CG.shfl val from thread 8. Also, thread 8 appears thrice, for (presumably) three stride steps. The latter looks like a synchronisation issue, but adding CG.sync(cg) causes a deadlock. In any case, the issues at the higher lanes then propagate downwards in further iterations.

  • Can anyone explain what is happening here, and how to fix it?
  • In general, should one explicitly synchronise around CG.shfl, or is this (despite the name) more akin to CUDA.shfl_sync?
    • The implementation of reduce_sum_tile_shfl in NVIDIA’s blog post on cooperative groups does not synchronise. But I guess they assume tile_sz is a power of two, in which case there is no need for a conditional, so that everything automatically runs in sync?
versioninfo
julia> versioninfo()
Julia Version 1.11.2
Commit 5e9a32e7af (2024-12-01 20:02 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 × Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, skylake)
Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores)
Environment:
  JULIA_NUM_THREADS = auto

julia> CUDA.versioninfo()
CUDA runtime 12.6, artifact installation
CUDA driver 12.7
NVIDIA driver 566.14.0

CUDA libraries:
- CUBLAS: 12.6.4
- CURAND: 10.3.7
- CUFFT: 11.3.0
- CUSOLVER: 11.7.1
- CUSPARSE: 12.5.4
- CUPTI: 2024.3.2 (API 24.0.0)
- NVML: 12.0.0+566.14

Julia packages:
- CUDA: 5.5.2
- CUDA_Driver_jll: 0.10.4+0
- CUDA_Runtime_jll: 0.15.5+0

Toolchain:
- Julia: 1.11.2
- LLVM: 16.0.6

1 device:
  0: NVIDIA GeForce RTX 3070 (sm_86, 5.853 GiB / 8.000 GiB available)

That’s because thread 8 doesn’t participate in the shfl; for thread 8 higher_cg_lane is 9 so you don’t call CG.shfl.

You’re maybe better off keeping all threads participating and doing something like Cooperative Groups: Flexible CUDA Thread Programming | NVIDIA Technical Blog?

1 Like

Thanks! I guess I misread the documentation

Threads may only read data from another thread which is actively participating in the __shfl_sync() command. If the target thread is inactive, the retrieved value is undefined.

to mean you could also read from lanes not participating in the shfl, but you’d get garbage values (and so you probably shouldn’t, cf. first sentence).

Yeah, just always performing the CG.shfl and afterwards filtering on lane indeed works. (I still need this filtering because in contrast to the blog post example, I’m not assuming num_threads (tile_sz) is a power of two. (It is then also a bit easier to work with increasing strides.))

Concretely, changing

v = if higher_cg_lane <= CG.num_threads(cg)
    CG.shfl(cg, val, higher_cg_lane)  # (or CG.shfl_down(cg, val, stride) )
else
    0.f0
end
val += v

to

v = CG.shfl_down(cg, val, stride)  # NOT CG.shfl(cg, val, higher_cg_lane)
if higher_cg_lane <= CG.num_threads(cg)
    val += v
end

fixes the issues.

I’m not sure why, but CG.shfl(cg, val, higher_cg_lane) causes an InexactError on thread 8, i.e. presumably when higher_cg_lane does not correspond to a participating thread. Maybe the difference is somehow related to __shfl_sync(mask, var, srcLane[, width])'s

If srcLane is outside the range [0:width-1], the value returned corresponds to the value of var held by the srcLane modulo width (i.e. within the same subsection)

compared to

__shfl_sync_down(mask, var, delta[, width])'s

the ID number of the source lane will not wrap around the value of width and so the upper delta lanes will remain unchanged.

(which is the only difference I can see between CG.shfl(cg, val, CG.thread_rank(cg) + stride) and CG.shfl_down(cg, val, stride))?

Seeing as in the corrected code I have some branching but no synchronisation issues (which remains so even if I synthetically add tons of extra work only for CG.thread_rank(cg) == 1), I guess the answer here is then that you don’t have to explicitly synchronise.

Fixed output
num_threads in coalesced group: 8

1 (1.000000) takes from 2 (if inbounds)
2 (2.000000) takes from 3 (if inbounds)
3 (3.000000) takes from 4 (if inbounds)
4 (4.000000) takes from 5 (if inbounds)
5 (5.000000) takes from 6 (if inbounds)
6 (6.000000) takes from 7 (if inbounds)
7 (7.000000) takes from 8 (if inbounds)
8 (8.000000) takes from 9 (if inbounds)

1 (3.000000) takes from 3 (if inbounds)
2 (5.000000) takes from 4 (if inbounds)
3 (7.000000) takes from 5 (if inbounds)
4 (9.000000) takes from 6 (if inbounds)
5 (11.000000) takes from 7 (if inbounds)
6 (13.000000) takes from 8 (if inbounds)
7 (15.000000) takes from 9 (if inbounds)
8 (8.000000) takes from 10 (if inbounds)

1 (10.000000) takes from 5 (if inbounds)
2 (14.000000) takes from 6 (if inbounds)
3 (18.000000) takes from 7 (if inbounds)
4 (22.000000) takes from 8 (if inbounds)
5 (26.000000) takes from 9 (if inbounds)
6 (21.000000) takes from 10 (if inbounds)
7 (15.000000) takes from 11 (if inbounds)
8 (8.000000) takes from 12 (if inbounds)

Expected: 36.0,  Got: 36.0

Yeah, the ffs(mask of 8 threads, 9th bit) breaks thinkgs because of returning -1. I could probably fix that, but I think it’s here also undefined behavior to call shfl on a non-participating thread.

1 Like