Hi there, I was trying to implement a custom sumdotproductthingy using CUDA.jl kernels when I stumbled upon this documentation PR. It stems from a discussion had not too long ago here.
I probably do not understand some fundamentals when it comes to CUDA kernel programming. So please bare with me, I gathered my questions here and would appreciate any feedback/guidance.
 Why is
sum_atomic
still so much slower thanCUDA.sum
? There is a short note in the end of the PR referring to threads accessingout
but nothing concrete. How cansum_atomic
be improved? Code from the PR link:
function run(kernel, arr)
out = CUDA.zeros(eltype(arr))
CUDA.@sync begin
@cuda threads=128 blocks=1024 kernel(out, arr)
end
out[]
end
function sum_atomic(out, arr)
index = (blockIdx().x  1) * blockDim().x + threadIdx().x
stride = blockDim().x * gridDim().x
acc = zero(eltype(out))
for i = index:stride:length(arr)
@inbounds acc += arr[i]
end
@atomic out[] += acc
return nothing
end
@btime run(sum_atomic, arr) # 266.052 μs (46 allocations: 1.41 KiB)
@btime sum(arr) # 37.511 μs (62 allocations: 1.62 KiB)

Why is the number of threads specifically set to
128
and of blocks to1024
? Is this choice backed up by some fact? I too seem to get optimal results for this combination, however, isn’t the correct way to choosenumblocks = ceil(Int, N/numthreads)
according to the documentation? When I do this the performance is worse thanthreads=128
andblocks=1024
. And shouldn’tnumthreads * numblocks > length(out)
hold? What am I missing? 
Also: are complexvalued CuArrays not supported yet? I get the
kernel returns a value of type Union{}
error when initializingout
as aCuArray{ComplexF32}