How can I improve the efficiency of the CUDA kernel？

zihuizh · April 21, 2025, 8:57am

I’m using Julia’s CUDA.jl to develop a Smoothed Particle Hydrodynamics program, and I’m wondering what methods can improve the performance of CUDA kernels. Should I use shared memory? Would using something like ParallelStencil.jl help improve performance? Thank a lot! On an RTX 4060 Laptop GPU, the performance of custom CUDA kernels seems to be better than that of array-style CUDA programming.

CUDA kernel example

using CUDA
using BenchmarkTools

n = 10^7
drho = CUDA.zeros(Float32, n)
m = CUDA.rand(Float32, n)
vd = CUDA.rand(Float32, (3, n))
dw = CUDA.rand(Float32, (3, n))
source = CUDA.rand(Bool, n)

function kernel(source, drho, m, vd, dw, len)
    i = threadIdx().x + (blockIdx().x - 1) * blockDim().x

    if i <= len
        drho[i] = source[i] * m[i] * (vd[1, i] * dw[1, i] + vd[2, i] * dw[2, i] + vd[3, i] * dw[3, i])
    end

    return nothing
end

@btime CUDA.@sync begin
    @cuda threads=256 blocks=cld($n, 256) kernel($source, $drho, $m, $vd, $dw, $n)
end

The result:

1.356 ms (28 allocations: 528 bytes)

Is this the correct way to calculate the actual operating efficiency?

t_it = @belapsed begin
    @cuda threads=256 blocks=cld($n, 256) kernel($source, $drho, $m, $vd, $dw, $n)
    synchronize()
end
T_tot_lb = 8*1/1e9*n*sizeof(Float32)/t_it
# T_tot_lb = 236.26698168930892 [GB/s]

CUDA array programming

@btime CUDA.@sync begin 
    drho = source .* m .* sum((vd .* dw), dims = 1)'
end

The result:

2.679 ms (298 allocations: 8.20 KiB)

yolhan_mannes · April 21, 2025, 9:01am

Hello, since you’ve make sure the index is inbound, add @inbounds before the if.

also the whole point is to reduce the number of register which you can check with CUDA.@profile trace=true yourfunc().
Inbounding should already reduce it by a lot (40->16 or something).

Finally, I think CUDA has a way to deduce the best nb of threads and block check it’s doc for that.

You should still check that you can’t do better with seperate kernels, sometime a complicated kernel is less performant than two simple ones ( I tell you this because of the dot products you’re doing that CuBLAS may do a lot better).

Note that this

@btime CUDA.@sync begin 
    drho = source .* m .* sum((vd .* dw), dims = 1)'
end

Is spawning quite a lot of kernels 1 for outer broadcast, one for inner, two for the sum and one for transpose (5 kernels) and two allocation by the sum and the transpose, writing it with Tullio.jl for instance would keep it clean and only launch one kernel, but I think reduction isn’t that great with it, so still need to check.

zihuizh · April 21, 2025, 9:22am

Thank you very much for your answer！Using @inbounds and adjusting the number of threads doesn’t seem to significantly affect the runtime. I suspect this kernel is already close to its performance limit.
Thank you for suggesting the use of cuBLAS — I’ll definitely give it a try.

function kernel(source, drho, m, vd, dw, len)
    i = threadIdx().x + (blockIdx().x - 1) * blockDim().x

    if i <= len
        @inbounds drho[i] = source[i] * m[i] * (vd[1, i] * dw[1, i] + vd[2, i] * dw[2, i] + vd[3, i] * dw[3, i])
    end

    return nothing
end

launchkernel = @cuda launch=false kernel(source, drho, m, vd, dw, n)
config = launch_configuration(launchkernel.fun)
threads = min(n, config.threads)
blocks = cld(n, threads)
@btime CUDA.@sync begin
    @cuda threads=$threads blocks=$blocks kernel($source, $drho, $m, $vd, $dw, $n)
end
# 1.358 ms (28 allocations: 528 bytes)

yolhan_mannes · April 21, 2025, 9:26am

Yeah that’s already extremely fast and probably memory bounded I’m bad at theorical max perf do you know how to check that? Like knowing you’re running 5 multiplication and 2 addition on 1e7 elements what’s the theorical max depending on threads number

zihuizh · April 21, 2025, 10:41am

Hey! You can check out the Performance Metrics section in ParallelStencil.jl, and also the first lesson of the julia-gpu-course — both go over this kind of stuff.
They mainly focus on memory throughput as a performance indicator, since on GPUs, compute power usually isn’t the bottleneck anymore.
So basically, you take how much memory your kernel moves around, divide it by the runtime, and that gives you the throughput per second — then you can compare that to the GPU’s theoretical bandwidth.

T_tot_lb = 8*1/1e9*n*sizeof(Float32)/t_it
# T_tot_lb = 236.26698168930892 [GB/s]
# RTX 4060 Laptop 256 [GB/s]

Same idea applies to FLOPs, but in my case, the actual FLOPs is way lower than the theoretical max.

a = CUDA.rand(Float32, 10^7)
b = CUDA.rand(Float32, 10^7)

c = CUDA.zeros(Float32, 10^7)

t_it = @belapsed begin
    c .= a .+ b
    synchronize()
end
T_tot = 1.0 * 10^7 / t_it
# 1.9073049780659927e10
# 4060 Laptop 14.56TFLOPS = 1.456e13

Topic		Replies	Views
Cuda kernel help Performance	2	48	March 29, 2025
CUDA.jl is slowed down after some number of iterations GPU cuda	9	240	December 22, 2024
CUDA.jl tutorial code kernel slower than broadcast Performance gpu , cuda	4	688	March 30, 2021
CUDA \| nested loops kernel GPU question	5	161	May 12, 2025
Performance difference among three CUDA kernels with same results Performance cuda	8	1106	April 3, 2022

How can I improve the efficiency of the CUDA kernel？

Related topics