I wrote a small test program to learn a bit more of CUDA.jl. I find it hard to understand why the test3! function is so much slower than test1! and test2!, especially because this is the way one would write a CUDA kernel in C++/CUDA. Also, why is test1! slower than test2!?
I replied earlier, but as it turned out all the benchmarking was incorrect because we weren’t calling CUDA.@sync. Both kernels and array programming are asynchronous, unless there’s a copy from GPU to host which calls an implicit synchoronize(). So I guess we were just measuring kernel launches.
With CUDA.@sync inserted prior to each testn!(), I get the following results:
@btime test2!(a) evals=10 samples=100
@btime begin CUDA.@sync test1!(a_gpu) end evals=100 samples=1000
@btime begin CUDA.@sync test2!(a_gpu) end evals=100 samples=1000
@btime begin CUDA.@sync test3!(a_gpu) end evals=100 samples=1000
CPU test2!(a): 15.919 ms (1 allocation: 80 bytes)
test1!(a_gpu) 2.229 ms (82 allocations: 4.56 KiB)
test2!(a_gpu) 678.646 μs (38 allocations: 3.32 KiB)
test3!(a_gpu) 630.090 μs (17 allocations: 1.03 KiB)
So in actual fact your kernel is perfectly performant.
FWIW, I would have set a launch configuration that ensured I was making sequential accesses to memory across the block. As it turns out this doesn’t make any difference in this case (possibly because the warp size matches your thread dimensions), but here’s my (imo slightly simpler) version anyway:
@inbounds function test4_kernel(a)
i = threadIdx().x + 1
j = (blockIdx().y - 1) * blockDim().y + threadIdx().y + 1
k = blockIdx().z + 1
a[i, j, k] *= 0.1
return nothing
end
function test4!(a)
threads = (256, 4, 1)
blocks = (1, 64, 256)
@cuda threads=threads blocks=blocks test4_kernel(a)
end
@btime begin CUDA.@sync test4!(a_gpu) end evals=100 samples=1000
test4!() 628.820 μs (17 allocations: 1.03 KiB)
It’s a good question. Normal julia expands a[m:n] .*= b to a[m:n] .= a[m:n] .* b. Julia fuses both of those dot operations during broadcasting to create a single loop over the indices m:n.
I seem to recall the CUDA docs once explicitly stated that CUDA.jl does not fuse dot operations, which would mean these occur at two different stages, creating an interim array.
I’m doubting myself here, so perhaps someone more knowledgeable can answer.
I would caution that BenchmarkTools does not capture GPU memory allocations, but in this case we can see from the lowering that it indeed does not make a copy: