I replied earlier, but as it turned out all the benchmarking was incorrect because we weren’t calling CUDA.@sync. Both kernels and array programming are asynchronous, unless there’s a copy from GPU to host which calls an implicit synchoronize(). So I guess we were just measuring kernel launches.
With CUDA.@sync inserted prior to each testn!(), I get the following results:
@btime test2!(a) evals=10 samples=100
@btime begin CUDA.@sync test1!(a_gpu) end evals=100 samples=1000
@btime begin CUDA.@sync test2!(a_gpu) end evals=100 samples=1000
@btime begin CUDA.@sync test3!(a_gpu) end evals=100 samples=1000
CPU test2!(a): 15.919 ms (1 allocation: 80 bytes)
test1!(a_gpu) 2.229 ms (82 allocations: 4.56 KiB)
test2!(a_gpu) 678.646 μs (38 allocations: 3.32 KiB)
test3!(a_gpu) 630.090 μs (17 allocations: 1.03 KiB)
So in actual fact your kernel is perfectly performant.
FWIW, I would have set a launch configuration that ensured I was making sequential accesses to memory across the block. As it turns out this doesn’t make any difference in this case (possibly because the warp size matches your thread dimensions), but here’s my (imo slightly simpler) version anyway:
@inbounds function test4_kernel(a)
i = threadIdx().x + 1
j = (blockIdx().y - 1) * blockDim().y + threadIdx().y + 1
k = blockIdx().z + 1
a[i, j, k] *= 0.1
return nothing
end
function test4!(a)
threads = (256, 4, 1)
blocks = (1, 64, 256)
@cuda threads=threads blocks=blocks test4_kernel(a)
end
@btime begin CUDA.@sync test4!(a_gpu) end evals=100 samples=1000
test4!() 628.820 μs (17 allocations: 1.03 KiB)