CUDA global synchronization HOWTO

Dear Julianners,
I try to create an algorithm that runs an elementwise update operation and a reduction in 10k iteration and about 1_000_000 times, so the kernel restarts(2-8us) are really expensive in this scenario.
The algorithm is very simple but on GPU I need to sync all the calculations before the reduce_sum.

I simplified the whole algorithm to this:

using BenchmarkTools
using CUDA
cudatestSync(c, a, b, X) = begin
	I = ((blockIdx().x - 1) * blockDim().x + threadIdx().x)
	I > size(c,1) * size(c,2) && return
	ca = CUDA.Const(a)
	cb = CUDA.Const(b)
	for i in 1:20
		c[I] += ca[I] + cb[I] 
	# sync_threads()
	# CUDA.device_synchronize()
	# sync_threads()
	# device_synchronize()
	# while syncCounter<size(c,1) * size(c,2)
	# 	if ok[I]==0
	# 		atomic_inc!(syncCounter);
	# 		ok[I]+=1
	# 	end
	# end
	# nanosleep(1000_000_000)
	# threadfence()
	# threadfence_block()
	# synchronize()
	gh = this_grid()
	if I <= size(c,1) 
		for j in 1:size(c,2)
			X[I] += c[I,j]
N = 1000
cf,af,bf=CUDA.randn(Float32, N, N),CUDA.randn(Float32, N, N),CUDA.randn(Float32, N, N)
C,A,B = copy(cf),copy(af),copy(bf); X=CUDA.zeros(Float32,N)
@time CUDA.@sync @cuda threads=nth blocks=cld(N*N, nth)  cooperative=true cudatestSync(C,A,B,X))
@show sum(cf .+ 20 .* (af .+ bf) ; dims=2)[1:10]
@show X[1:10]
@assert all(sum(cf .+ 20 .* (af .+ bf) ; dims=2) .≈ X) "sync doesn't WORK!"

I tried many different method that is supportted by the CUDA.jl. I know lots of them is just stupid try, I just let it there as I just tried them out.
I believed cooperative groups would be the way to go:
But I get
ERROR: LoadError: CUDA error: too many blocks in cooperative launch (code 720, ERROR_COOPERATIVE_LAUNCH_TOO_LARGE)

So the goal is to reach global synchronization before the reduce_sum. How can I do it without exiting a kernel?

Global synchronization, across SMs, is just not what CUDA is meant for. Cooperative groups can hide that, but it’ll impact performance and the possible launch configuration (as seen here). I recommend you try to fuse more so that you can reduce the number of launches without requiring a sync. If you really need a global sync, only launch a single block so that syncthreads is global.

I fused 3 kernel call into one, the only last one remained the reduce_sum, but If I could fuse this I could reduce the kernel calls by 2x 10_000 times as I could move the whole cycle into the kernel.

For me it is extremly big deal and important to be able to fuse this. This algorithm could be so ridiculous fast with that, I cannot even imagine. Now it needs like 25 hours to run.
I have to do it only once. Any possible solution would be great!

The way they did it sounds extremely elegant:

Is this possible?

For me I cannot make atomic_inc!(block_num) work due to:

KernelError: kernel returns a value of type `Union{}`

Make sure your kernel function ends in `return`, `return nothing` or `nothing`.
If the returned value is of type `Union{}`, your Julia code probably throws an exception.
Inspect the code with `@device_code_warntype` for more details.

What can be the problem?

You’re invoking this method incorrectly. Please read the documentation: Troubleshooting · CUDA.jl. Look at the tests for valid invocations of atomic_inc!.

1 Like

Ok, I was calling a wrong function all the time. XD Great… atomic_inc! works as described. Btw… the way it works is mindbogling. :smiley: I was like it will be like a ++i

Totally excited about this method they described. :o

That only works for your particular kernel/set-up/inputs. In general it is impossible, because you can have a configuration where not all blocks can be scheduled at the same time, due to resource constraints. If you then come up with some sort of operation that requires all threads to reach that point, you’ll just deadlock.

Yeah… that is a very serious problem.
I should ask “the thread to stop at this point and work on another thread’s job and reschedule when condition succeed”. :frowning:

That’s not possible with the current programming model. Hence; “you shouldn’t be doing this”. The thread you linked to has several other people proclaiming the same thing. If it works for your specific problem at hand, that’s great, but don’t expect it to generalize to other problem sizes / hardware / Julia versions / data types / … without carefully verifying it does.

No, you should be right in my scenario too. The dataset is bigger than the ‘parallel lanes’ or how to say.

Shit I misscalculated this very much. I was calculating with finishing “threads”. But this should make a deadlock.

This is VERY sad. I really wanted to spare the kernel starts. :frowning: