How can I improve the efficiency of the CUDA kernel?

I’m using Julia’s CUDA.jl to develop a Smoothed Particle Hydrodynamics program, and I’m wondering what methods can improve the performance of CUDA kernels. Should I use shared memory? Would using something like ParallelStencil.jl help improve performance? Thank a lot! On an RTX 4060 Laptop GPU, the performance of custom CUDA kernels seems to be better than that of array-style CUDA programming.

  1. CUDA kernel example
using CUDA
using BenchmarkTools

n = 10^7
drho = CUDA.zeros(Float32, n)
m = CUDA.rand(Float32, n)
vd = CUDA.rand(Float32, (3, n))
dw = CUDA.rand(Float32, (3, n))
source = CUDA.rand(Bool, n)

function kernel(source, drho, m, vd, dw, len)
    i = threadIdx().x + (blockIdx().x - 1) * blockDim().x

    if i <= len
        drho[i] = source[i] * m[i] * (vd[1, i] * dw[1, i] + vd[2, i] * dw[2, i] + vd[3, i] * dw[3, i])
    end

    return nothing
end

@btime CUDA.@sync begin
    @cuda threads=256 blocks=cld($n, 256) kernel($source, $drho, $m, $vd, $dw, $n)
end

The result:

1.356 ms (28 allocations: 528 bytes)

Is this the correct way to calculate the actual operating efficiency?

t_it = @belapsed begin
    @cuda threads=256 blocks=cld($n, 256) kernel($source, $drho, $m, $vd, $dw, $n)
    synchronize()
end
T_tot_lb = 8*1/1e9*n*sizeof(Float32)/t_it
# T_tot_lb = 236.26698168930892 [GB/s]
  1. CUDA array programming
@btime CUDA.@sync begin 
    drho = source .* m .* sum((vd .* dw), dims = 1)'
end

The result:

2.679 ms (298 allocations: 8.20 KiB)

Hello, since you’ve make sure the index is inbound, add @inbounds before the if.

also the whole point is to reduce the number of register which you can check with CUDA.@profile trace=true yourfunc().
Inbounding should already reduce it by a lot (40->16 or something).

Finally, I think CUDA has a way to deduce the best nb of threads and block check it’s doc for that.

You should still check that you can’t do better with seperate kernels, sometime a complicated kernel is less performant than two simple ones ( I tell you this because of the dot products you’re doing that CuBLAS may do a lot better).

Note that this

@btime CUDA.@sync begin 
    drho = source .* m .* sum((vd .* dw), dims = 1)'
end

Is spawning quite a lot of kernels 1 for outer broadcast, one for inner, two for the sum and one for transpose (5 kernels) and two allocation by the sum and the transpose, writing it with Tullio.jl for instance would keep it clean and only launch one kernel, but I think reduction isn’t that great with it, so still need to check.

1 Like

Thank you very much for your answer!Using @inbounds and adjusting the number of threads doesn’t seem to significantly affect the runtime. I suspect this kernel is already close to its performance limit.
Thank you for suggesting the use of cuBLAS — I’ll definitely give it a try.

function kernel(source, drho, m, vd, dw, len)
    i = threadIdx().x + (blockIdx().x - 1) * blockDim().x

    if i <= len
        @inbounds drho[i] = source[i] * m[i] * (vd[1, i] * dw[1, i] + vd[2, i] * dw[2, i] + vd[3, i] * dw[3, i])
    end

    return nothing
end

launchkernel = @cuda launch=false kernel(source, drho, m, vd, dw, n)
config = launch_configuration(launchkernel.fun)
threads = min(n, config.threads)
blocks = cld(n, threads)
@btime CUDA.@sync begin
    @cuda threads=$threads blocks=$blocks kernel($source, $drho, $m, $vd, $dw, $n)
end
# 1.358 ms (28 allocations: 528 bytes)
1 Like

Yeah that’s already extremely fast and probably memory bounded :+1: I’m bad at theorical max perf do you know how to check that? Like knowing you’re running 5 multiplication and 2 addition on 1e7 elements what’s the theorical max depending on threads number

Hey! You can check out the Performance Metrics section in ParallelStencil.jl, and also the first lesson of the julia-gpu-course — both go over this kind of stuff.
They mainly focus on memory throughput as a performance indicator, since on GPUs, compute power usually isn’t the bottleneck anymore.
So basically, you take how much memory your kernel moves around, divide it by the runtime, and that gives you the throughput per second — then you can compare that to the GPU’s theoretical bandwidth.

T_tot_lb = 8*1/1e9*n*sizeof(Float32)/t_it
# T_tot_lb = 236.26698168930892 [GB/s]
# RTX 4060 Laptop 256 [GB/s]

Same idea applies to FLOPs, but in my case, the actual FLOPs is way lower than the theoretical max.

a = CUDA.rand(Float32, 10^7)
b = CUDA.rand(Float32, 10^7)

c = CUDA.zeros(Float32, 10^7)

t_it = @belapsed begin
    c .= a .+ b
    synchronize()
end
T_tot = 1.0 * 10^7 / t_it
# 1.9073049780659927e10
# 4060 Laptop 14.56TFLOPS = 1.456e13
1 Like