CUDA global synchronization HOWTO

Marcell_Havlik · January 20, 2022, 9:57am

Dear Julianners,
I try to create an algorithm that runs an elementwise update operation and a reduction in 10k iteration and about 1_000_000 times, so the kernel restarts(2-8us) are really expensive in this scenario.
The algorithm is very simple but on GPU I need to sync all the calculations before the reduce_sum.

I simplified the whole algorithm to this:

using BenchmarkTools
using CUDA
cudatestSync(c, a, b, X) = begin
	I = ((blockIdx().x - 1) * blockDim().x + threadIdx().x)
	I > size(c,1) * size(c,2) && return
	ca = CUDA.Const(a)
	cb = CUDA.Const(b)
	for i in 1:20
		c[I] += ca[I] + cb[I] 
	end
	# sync_threads()
	# CUDA.device_synchronize()
	# sync_threads()
	# device_synchronize()
	# while syncCounter<size(c,1) * size(c,2)
	# 	if ok[I]==0
	# 		atomic_inc!(syncCounter);
	# 		ok[I]+=1
	# 	end
	# end
	# nanosleep(1000_000_000)
	# threadfence()
	# threadfence_block()
	# synchronize()
	gh = this_grid()
	sync_grid(gh)
	if I <= size(c,1) 
		for j in 1:size(c,2)
			X[I] += c[I,j]
		end
	end
	return
end
nth=512	
N = 1000
cf,af,bf=CUDA.randn(Float32, N, N),CUDA.randn(Float32, N, N),CUDA.randn(Float32, N, N)
C,A,B = copy(cf),copy(af),copy(bf); X=CUDA.zeros(Float32,N)
@time CUDA.@sync @cuda threads=nth blocks=cld(N*N, nth)  cooperative=true cudatestSync(C,A,B,X))
@show sum(cf .+ 20 .* (af .+ bf) ; dims=2)[1:10]
@show X[1:10]
@assert all(sum(cf .+ 20 .* (af .+ bf) ; dims=2) .≈ X) "sync doesn't WORK!"

I tried many different method that is supportted by the CUDA.jl. I know lots of them is just stupid try, I just let it there as I just tried them out.
I believed cooperative groups would be the way to go:
https://developer.nvidia.com/blog/cooperative-groups/
cuda-robus-scalable.pdf)
But I get
ERROR: LoadError: CUDA error: too many blocks in cooperative launch (code 720, ERROR_COOPERATIVE_LAUNCH_TOO_LARGE)

So the goal is to reach global synchronization before the reduce_sum. How can I do it without exiting a kernel?

maleadt · January 20, 2022, 10:00am

Global synchronization, across SMs, is just not what CUDA is meant for. Cooperative groups can hide that, but it’ll impact performance and the possible launch configuration (as seen here). I recommend you try to fuse more so that you can reduce the number of launches without requiring a sync. If you really need a global sync, only launch a single block so that syncthreads is global.

Marcell_Havlik · January 20, 2022, 11:01am

I fused 3 kernel call into one, the only last one remained the reduce_sum, but If I could fuse this I could reduce the kernel calls by 2x 10_000 times as I could move the whole cycle into the kernel.

For me it is extremly big deal and important to be able to fuse this. This algorithm could be so ridiculous fast with that, I cannot even imagine. Now it needs like 25 hours to run.
I have to do it only once. Any possible solution would be great!

Marcell_Havlik · January 20, 2022, 11:44am

The way they did it sounds extremely elegant:

Is this possible?

For me I cannot make atomic_inc!(block_num) work due to:

KernelError: kernel returns a value of type `Union{}`

Make sure your kernel function ends in `return`, `return nothing` or `nothing`.
If the returned value is of type `Union{}`, your Julia code probably throws an exception.
Inspect the code with `@device_code_warntype` for more details.

What can be the problem?

maleadt · January 20, 2022, 11:46am

You’re invoking this method incorrectly. Please read the documentation: Troubleshooting · CUDA.jl. Look at the tests for valid invocations of atomic_inc!.

Marcell_Havlik · January 20, 2022, 12:07pm

Ok, I was calling a wrong function all the time. XD Great… atomic_inc! works as described. Btw… the way it works is mindbogling. I was like it will be like a ++i

Totally excited about this method they described. :o

maleadt · January 20, 2022, 12:12pm

That only works for your particular kernel/set-up/inputs. In general it is impossible, because you can have a configuration where not all blocks can be scheduled at the same time, due to resource constraints. If you then come up with some sort of operation that requires all threads to reach that point, you’ll just deadlock.

Marcell_Havlik · January 20, 2022, 12:48pm

Yeah… that is a very serious problem.
I should ask “the thread to stop at this point and work on another thread’s job and reschedule when condition succeed”.

maleadt · January 20, 2022, 12:50pm

That’s not possible with the current programming model. Hence; “you shouldn’t be doing this”. The thread you linked to has several other people proclaiming the same thing. If it works for your specific problem at hand, that’s great, but don’t expect it to generalize to other problem sizes / hardware / Julia versions / data types / … without carefully verifying it does.

Marcell_Havlik · January 20, 2022, 12:52pm

No, you should be right in my scenario too. The dataset is bigger than the ‘parallel lanes’ or how to say.

Shit I misscalculated this very much. I was calculating with finishing “threads”. But this should make a deadlock.

This is VERY sad. I really wanted to spare the kernel starts.

Topic		Replies	Views
Global memory behavior explained GPU	9	1180	October 13, 2020
GPU Synchronization Issue - using KernelAbstraction GPU question	5	457	December 13, 2023
Most efficient way of _waiting_ for GPU results? GPU	20	3102	January 31, 2019
Notes on `CUDA.sync_threads` and dispatch on `Union` GPU gpu	3	1038	April 16, 2021
Synchronizing Cuda kernels GPU	5	2475	September 20, 2019

CUDA global synchronization HOWTO

Related topics