Dear Julianners,

I try to create an algorithm that runs an elementwise update operation and a reduction in 10k iteration and about 1_000_000 times, so the kernel restarts(2-8us) are really expensive in this scenario.

The algorithm is very simple but on GPU I need to sync all the calculations before the reduce_sum.

I simplified the whole algorithm to this:

```
using BenchmarkTools
using CUDA
cudatestSync(c, a, b, X) = begin
I = ((blockIdx().x - 1) * blockDim().x + threadIdx().x)
I > size(c,1) * size(c,2) && return
ca = CUDA.Const(a)
cb = CUDA.Const(b)
for i in 1:20
c[I] += ca[I] + cb[I]
end
# sync_threads()
# CUDA.device_synchronize()
# sync_threads()
# device_synchronize()
# while syncCounter<size(c,1) * size(c,2)
# if ok[I]==0
# atomic_inc!(syncCounter);
# ok[I]+=1
# end
# end
# nanosleep(1000_000_000)
# threadfence()
# threadfence_block()
# synchronize()
gh = this_grid()
sync_grid(gh)
if I <= size(c,1)
for j in 1:size(c,2)
X[I] += c[I,j]
end
end
return
end
nth=512
N = 1000
cf,af,bf=CUDA.randn(Float32, N, N),CUDA.randn(Float32, N, N),CUDA.randn(Float32, N, N)
C,A,B = copy(cf),copy(af),copy(bf); X=CUDA.zeros(Float32,N)
@time CUDA.@sync @cuda threads=nth blocks=cld(N*N, nth) cooperative=true cudatestSync(C,A,B,X))
@show sum(cf .+ 20 .* (af .+ bf) ; dims=2)[1:10]
@show X[1:10]
@assert all(sum(cf .+ 20 .* (af .+ bf) ; dims=2) .â X) "sync doesn't WORK!"
```

I tried many different method that is supportted by the CUDA.jl. I know lots of them is just stupid try, I just let it there as I just tried them out.

I believed cooperative groups would be the way to go:

`https://developer.nvidia.com/blog/cooperative-groups/`

cuda-robus-scalable.pdf)

But I get

`ERROR: LoadError: CUDA error: too many blocks in cooperative launch (code 720, ERROR_COOPERATIVE_LAUNCH_TOO_LARGE)`

So the goal is to reach global synchronization before the `reduce_sum`

. How can I do it without exiting a kernel?