Sequence of warp and how to avoid divergence when folding shared memory in a reduction kernel

cuda

#1

Hi everyone,

I jumped into CUDA programming mid-way without good background knowledge. It is just because CUDAnative.jl is so easy to program with. Now I have a problem related to CUDA, not not the package CUDAnative.jl. But I believe that there will be experts here to answer my question!

My question came from the effort in 2d numerical integral. The following function is the final step to do the summation of BF[X,w1,w2]*dω1dω2[w1,w2] on the grid. The equivalent CPU code is

OUT[x] = sum( BF[x,:,:] .* dω1dω2[:,:] ) for x in X

Since the mesh for w1,w2 is large, I expect acceleration with GPU.

The following code consists of two parts. The first part has a double-while loop, which accumulates on the shared memory sdata. The second part is a familiar one, which “fold” the shared memory into the final result. Similar code for a 1d shared memory is a standard CUDA sample code.

My question is: what is the warp execution order if I use threadIdx().x and threadIdx().y ? Should I reverse the order of “folding”, by firstly do it on the y-dimension? The answer is crucial to the performance of the second part. If I got the wrong understanding, the second part will probably be “divergent” – different threads in a warp are executing different if-else branches. The warp size is 32.


#NOTE goal:  sum( BF[X,:,:] .* dω1dω2[:,:] ) ==> OUT[X,b3] for each X
function Trapezoid_kernel!(
	dimsBF::NTuple{5,Int64},
	OUT,							#signature (:X, :ω)
	BF,								#signature (:X, :ω1-1, :ω2-1)
	dω1dω2							#signature (:ω1-1, :ω2-1)
	)
	# -------------------------------------------------
	(dimBF1,dimBF2,dimBF3) = dimsBF
	# -------------------------------------------------
	sdata = @cuStaticSharedMem(ComplexF64,(32,32)) #XXX could be larger!
	sdata[threadIdx().x,threadIdx().y] = zero(ComplexF64)
	# -------------------------------------------------
	if blockIdx().z<=dimBF1 && blockIdx().x==1 && blockIdx().y==1
		tidx = threadIdx().x
		while tidx <= dimBF2
			tidy = threadIdx().y
			while tidy <= dimBF3
				#NOTE accumulate on shared memory
				#NOTE BF[X,:,:] .* dω1dω2[:,:] where X = blockIdx().z
				sdata[threadIdx().x,threadIdx().y] += (BF[blockIdx().z,tidx,tidy]*dω1dω2[tidx,tidy])
				tidy += 32
			end
			tidx += 32
		end
		# ---------------------- REDUCE -------------------------
		sync_threads()
		# ------------------------------------------------------
		#QUESTION the warp sequence? how to avoid digergence ?
		if threadIdx().x<=16
			@inbounds sdata[threadIdx().x,threadIdx().y] += sdata[threadIdx().x+16,threadIdx().y]
		end
		sync_threads()
		if threadIdx().x<=8
			@inbounds sdata[threadIdx().x,threadIdx().y] += sdata[threadIdx().x+8,threadIdx().y]
		end
		sync_threads()
		if threadIdx().x<=4
			@inbounds sdata[threadIdx().x,threadIdx().y] += sdata[threadIdx().x+4,threadIdx().y]
		end
		sync_threads()
		if threadIdx().x<=2
			@inbounds sdata[threadIdx().x,threadIdx().y] += sdata[threadIdx().x+2,threadIdx().y]
		end
		sync_threads()
		if threadIdx().x==1
			@inbounds sdata[threadIdx().x,threadIdx().y] += sdata[threadIdx().x+1,threadIdx().y]
		end
		sync_threads()
		if threadIdx().x==1 && threadIdx().y<=16
			@inbounds sdata[1,threadIdx().y] += sdata[1,threadIdx().y+16]
	    	sync_threads()
			@inbounds sdata[1,threadIdx().y] += sdata[1,threadIdx().y+8]
	    	sync_threads()
			@inbounds sdata[1,threadIdx().y] += sdata[1,threadIdx().y+4]
	    	sync_threads()
			@inbounds sdata[1,threadIdx().y] += sdata[1,threadIdx().y+2]
	    	sync_threads()
			@inbounds sdata[1,threadIdx().y] += sdata[1,threadIdx().y+1]
	    	sync_threads()
		end
		sync_threads()
		# ---------------------- WRITE -------------------------
		if threadIdx().x==1 && threadIdx().y==1
            @inbounds OUT[blockIdx().z] = sdata[1,1]
        end
	end
	nothing
end



#2

I just need some keywords and/or some links :smiley:


#3

I could have use linear indexing for the w1,w2 mesh, but this would require additional multiplication/division operations of the order (size(dω1dω2,1)/32)*(size(dω1dω2,2)/32) in the double-while loop.


#4

From the CUDA programming guide:

The index of a thread and its thread ID relate to each other in a straightforward way: For a one-dimensional block, they are the same; for a two-dimensional block of size (Dx, Dy),the thread ID of a thread of index (x, y) is (x + y Dx); for a three-dimensional block of size (Dx, Dy, Dz), the thread ID of a thread of index (x, y, z) is (x + y Dx + z Dx Dy).

Note that depending on your hardware, divergence might not be all that terrible (Independent Thread Scheduling on Volta), so be sure to profile.