How to use `@atomic` with CUDA?

I am trying to get atomic to work. I am trying to produce an MWE so the code by itself doesn’t make any sense.

I have a buffer which I am trying to add 1.0 for every thread in block that gets run.

using CUDA
CUDA.allowscalar(false)

function atleast2_gpu_v1!(buffer)
    i = threadIdx().x

    j = 0.0
    @atomic buffer[i] = +(buffer[i], j)
    #     @atomic buffer[i] = buffer[i] + j) # this also doesn't work

    return
end

threads=256

buffer = CUDA.zeros(Float32, threads)
blocks = 1_000_000 ÷ threads
@device_code_warntype @cuda threads = threads blocks = blocks atleast2_gpu_v1!(buffer)

and I get this error complaining about returning a Union, but clearly I am return nothing, so it’s really mysterious what’s going on

PTX CompilerJob of kernel atleast2_gpu_v1!(CuDeviceArray{Float32,1,1}) for sm_75

Variables
  #self#::Core.Compiler.Const(atleast2_gpu_v1!, false)
  buffer::CuDeviceArray{Float32,1,1}
  i::Int64
  j::Float64

Body::Union{}
1 ─ %1 = Main.threadIdx()::NamedTuple{(:x, :y, :z),Tuple{Int64,Int64,Int64}}   
│        (i = Base.getproperty(%1, :x))
│        (j = 0.0)
│   %4 = Core.tuple(i)::Tuple{Int64}
│        (CUDA.atomic_arrayset)(buffer, %4, Main.:+, j::Core.Compiler.Const(0.0, false))
└──      Core.Compiler.Const(:(return), false)
ERROR: LoadError: GPU compilation of kernel atleast2_gpu_v1!(CuDeviceArray{Float32,1,1}) failed
KernelError: kernel returns a value of type `Union{}`

Make sure your kernel function ends in `return`, `return nothing` or `nothing`.If the returned value is of type `Union{}`, your Julia code probably throws an exception.
PTX CompilerJob of kernel atleast2_gpu_v1!(CuDeviceArray{Float32,1,1}) for sm_75

Variables
  #self#::Core.Compiler.Const(atleast2_gpu_v1!, false)
  buffer::CuDeviceArray{Float32,1,1}
  i::Int64
  j::Float64

Body::Union{}
1 ─ %1 = Main.threadIdx()::NamedTuple{(:x, :y, :z),Tuple{Int64,Int64,Int64}}
│        (i = Base.getproperty(%1, :x))
│        (j = 0.0)
│   %4 = Core.tuple(i)::Tuple{Int64}
│        (CUDA.atomic_arrayset)(buffer, %4, Main.:+, j::Core.Compiler.Const(0.0, false))
└──      Core.Compiler.Const(:(return), false)
ERROR: LoadError: GPU compilation of kernel atleast2_gpu_v1!(CuDeviceArray{Float32,1,1}) failed
KernelError: kernel returns a value of type `Union{}`

Make sure your kernel function ends in `return`, `return nothing` or `nothing`.If the returned value is of type `Union{}`, your Julia code probably throws an exception.
Inspect the code with `@device_code_warntype` for more details.

Stacktrace:
 [1] check_method(::GPUCompiler.CompilerJob) at C:\Users\RTX2080\.julia\packages\GPUCompiler\5xT46\src\validation.jl:18
 [2] macro expansion at C:\Users\RTX2080\.julia\packages\TimerOutputs\dVnaw\src\TimerOutput.jl:206 [inlined]
 [3] codegen(::Symbol, ::GPUCompiler.CompilerJob; libraries::Bool, deferred_codegen::Bool, optimize::Bool, strip::Bool, validate::Bool, only_entry::Bool) at C:\Users\RTX2080\.julia\packages\GPUCompiler\5xT46\src\driver.jl:63
 [4] compile(::Symbol, ::GPUCompiler.CompilerJob; libraries::Bool, deferred_codegen::Bool, optimize::Bool, strip::Bool, validate::Bool, only_entry::Bool) at C:\Users\RTX2080\.julia\packages\GPUCompiler\5xT46\src\driver.jl:39
 [5] compile at C:\Users\RTX2080\.julia\packages\GPUCompiler\5xT46\src\driver.jl:35 [inlined]
 [6] _cufunction(::GPUCompiler.FunctionSpec{typeof(atleast2_gpu_v1!),Tuple{CuDeviceArray{Float32,1,1}}}; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at C:\Users\RTX2080\.julia\packages\CUDA\1DBvk\src\compiler\execution.jl:311
 [7] _cufunction at C:\Users\RTX2080\.julia\packages\CUDA\1DBvk\src\compiler\execution.jl:305 [inlined]
 [8] check_cache(::typeof(CUDA._cufunction), ::GPUCompiler.FunctionSpec{typeof(atleast2_gpu_v1!),Tuple{CuDeviceArray{Float32,1,1}}}, ::UInt64; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at C:\Users\RTX2080\.julia\packages\GPUCompiler\5xT46\src\cache.jl:24
 [9] atleast2_gpu_v1! at c:\Users\RTX2080\AppData\Roaming\Code\User\globalStorage\buenon.scratchpads\scratchpads\2a695470f16de4fbb367ab34cdcda714\scratch80..jl:9 [inlined]
 [10] cached_compilation(::typeof(CUDA._cufunction), ::GPUCompiler.FunctionSpec{typeof(atleast2_gpu_v1!),Tuple{CuDeviceArray{Float32,1,1}}}, ::UInt64; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at C:\Users\RTX2080\.julia\packages\GPUCompiler\5xT46\src\cache.jl:0
 [11] cached_compilation at C:\Users\RTX2080\.julia\packages\GPUCompiler\5xT46\src\cache.jl:40 [inlined]
 [12] cufunction(::typeof(atleast2_gpu_v1!), ::Type{Tuple{CuDeviceArray{Float32,1,1}}}; name::Nothing, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at C:\Users\RTX2080\.julia\packages\CUDA\1DBvk\src\compiler\execution.jl:299
 [13] cufunction(::typeof(atleast2_gpu_v1!), ::Type{Tuple{CuDeviceArray{Float32,1,1}}}) at C:\Users\RTX2080\.julia\packages\CUDA\1DBvk\src\compiler\execution.jl:294
 [14] top-level scope at C:\Users\RTX2080\.julia\packages\CUDA\1DBvk\src\compiler\execution.jl:109
 [15] top-level scope at C:\Users\RTX2080\.julia\packages\GPUCompiler\5xT46\src\reflection.jl:144
 [16] include_string(::Function, ::Module, ::String, ::String) at .\loading.jl:1088
in expression starting at c:\Users\RTX2080\AppData\Roaming\Code\User\globalStorage\buenon.scratchpads\scratchpads\2a695470f16de4fbb367ab34cdcda714\scratch80..jl:40

j = Float32(0.0) this fixes it.