In your kernels, I don’t see any threadIdx
anywhere, so you’re just doing the same serial computation on 128 threads? You need to parallelize manually, eg. map your 128x128x128 over the block/grid configuration and have every thread perform what’s basically a single iteration of the for loop. There’s no autoparallelizing magic going on here, just adding @cuda threads=...
won’t make something parallel.
Alternatively, you can use abstractions that have this parallelization built in, such as broadcast. Your last solution uses two fused broadcast kernels like that. Since it’s a very simple stencil kernel though, there’s both some overhead from using broadcast, as well as the fact that you’re using two kernels. Fusing these together in an explicit CUDAnative stencil kernel should easily get you another 2x.
EDIT: gave it a quick go
julia> function kernel(f, δxf)
Nx, Ny, Nz = size(f)
k = threadIdx().x
j = blockIdx().x
i = blockIdx().y
δxf[i, j, k] = δx(f, Nx, i, j, k)
return
end
julia> @cuda threads=Nx blocks=(Ny, Nz) kernel(xg,yg)
julia> @test yg ≈ yc
Test Passed
julia> @benchmark CuArrays.@sync @cuda threads=Nx blocks=(Ny, Nz) kernel(xg,yg)
BenchmarkTools.Trial:
memory estimate: 1.31 KiB
allocs estimate: 31
--------------
minimum time: 8.043 μs (0.00% GC)
median time: 6.456 ms (0.00% GC)
mean time: 6.266 ms (0.00% GC)
maximum time: 6.957 ms (0.00% GC)
--------------
samples: 266
evals/sample: 3
Different hardware, different characteristics. But this only uses 128 threads which is heavily underusing my GPU, you’d generally (for trivial kernels like this) want to max out the thread count that your device supports (typically 512 or 2014) and do a more elaborate computation between threads/blocks to figure out the indices.
Furthermore, implementing stencils efficiently has been researched heavily so there’s loads of possible improvements.