I’m using Julia’s CUDA.jl to develop a Smoothed Particle Hydrodynamics program, and I’m wondering what methods can improve the performance of CUDA kernels. Should I use shared memory? Would using something like ParallelStencil.jl help improve performance? Thank a lot! On an RTX 4060 Laptop GPU, the performance of custom CUDA kernels seems to be better than that of array-style CUDA programming.
- CUDA kernel example
using CUDA
using BenchmarkTools
n = 10^7
drho = CUDA.zeros(Float32, n)
m = CUDA.rand(Float32, n)
vd = CUDA.rand(Float32, (3, n))
dw = CUDA.rand(Float32, (3, n))
source = CUDA.rand(Bool, n)
function kernel(source, drho, m, vd, dw, len)
i = threadIdx().x + (blockIdx().x - 1) * blockDim().x
if i <= len
drho[i] = source[i] * m[i] * (vd[1, i] * dw[1, i] + vd[2, i] * dw[2, i] + vd[3, i] * dw[3, i])
end
return nothing
end
@btime CUDA.@sync begin
@cuda threads=256 blocks=cld($n, 256) kernel($source, $drho, $m, $vd, $dw, $n)
end
The result:
1.356 ms (28 allocations: 528 bytes)
Is this the correct way to calculate the actual operating efficiency?
t_it = @belapsed begin
@cuda threads=256 blocks=cld($n, 256) kernel($source, $drho, $m, $vd, $dw, $n)
synchronize()
end
T_tot_lb = 8*1/1e9*n*sizeof(Float32)/t_it
# T_tot_lb = 236.26698168930892 [GB/s]
- CUDA array programming
@btime CUDA.@sync begin
drho = source .* m .* sum((vd .* dw), dims = 1)'
end
The result:
2.679 ms (298 allocations: 8.20 KiB)