# Correct utilisation of CUDA kernel for simulations

Hello,

I am new to julia and GPU computing, before today I didn’t care about optimisation (it was fast enough for me).

I use this kind of CUDA kernel (example for 2D diffusion using finite difference with euler forward):

``````using CUDA

function kernel_diff!(ρ_new, ρ, D, Nx, Nz)
i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
j = (blockIdx().y - 1) * blockDim().y + threadIdx().y
if i <= Nx && j <= Nz
i_ = mod(i-1,1:Nx); ip = mod(i+1,1:Nx)
jm = mod(j-1,1:Nz); jp = mod(j+1,1:Nz)
@inbounds ρ_new[i,j] = ρ[i,j] + D*(ρ[i_,j]+ρ[ip,j]+ρ[i,jm]+ρ[i,jp]-4*ρ[i,j])
end
return nothing
end

function diffusion!(ρ, D, Nt)
Nx, Nz = size(ρ)
WrapsT = 16
Bx = ceil(Int, Nx/WrapsT)
Bz = ceil(Int, Nz/WrapsT)
block_dim = (WrapsT, WrapsT)
grid_dim = (Bx, Bz)
ρ_new = similar(ρ)
for i=1:Nt
@cuda threads = block_dim blocks = grid_dim kernel_diff!(ρ_new, ρ, D, Nx, Nz)
ρ_new = ρ
end
end

ρ = CUDA.rand(Float64, 1000, 1000)
D = 1e-3
Nt = 2e7

@CUDA.time CUDA.@sync diffusion!(ρ, D, Nt)
``````

When I see the number of CPU allocations, i am wondering if i am doing things correctly.

What would I need to change to make it correct ?

Best

PS : I know that there better way to solve diffusion problem, but I use way more complicated kernel to solve more complicated problem with complex boundary conditions. Diffusion is just an example here.

1 Like

Maybe you want `copyto!(ρ, ρ_new)`? And maybe just call `diffusion!(ρ, D, Nt)`

Is this what you want?:

``````using CUDA
function kernel_diff!(ρ_new, ρ, D, Nx, Nz)
i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
j = (blockIdx().y - 1) * blockDim().y + threadIdx().y
if i <= Nx && j <= Nz
i_ = mod(i-1,1:Nx); ip = mod(i+1,1:Nx)
jm = mod(j-1,1:Nz); jp = mod(j+1,1:Nz)
@inbounds ρ_new[i,j] = ρ[i,j] + D*(ρ[i_,j]+ρ[ip,j]+ρ[i,jm]+ρ[i,jp]-4*ρ[i,j])
end
return nothing
end
function diffusion!(ρ_new, ρ, D, Nt)
Nx, Nz = size(ρ)
gpukernel = @cuda launch=false kernel_diff!(ρ_new, ρ, D, Nx, Nz)
config = launch_configuration(gpukernel.fun)
Bx, By = cld(Nx, Tx), cld(Nz, Ty)  # Blocks in grid.
blocks  = Bx, By
for i=1:Nt
CUDA.@sync gpukernel(ρ_new, ρ, D, Nx, Nz; threads = threads, blocks = blocks)
copyto!(ρ, ρ_new)
end
ρ_new
end

ρ = CUDA.rand(Float64, 1000, 1000)
ρ_new = similar(ρ)
D = 1e-3
Nt = 10000

using BenchmarkTools

@benchmark diffusion!(\$ρ_new, \$ρ, \$D, \$Nt)
``````
``````BenchmarkTools.Trial: 2 samples with 1 evaluation.
Range (min … max):  3.158 s …    3.464 s  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     3.311 s               ┊ GC (median):    0.00%
Time  (mean ± σ):   3.311 s ± 216.491 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

█                                                        █
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
3.16 s         Histogram: frequency by time         3.46 s <

Memory estimate: 44.20 MiB, allocs estimate: 720281.
``````
2 Likes

If I use `copyto!()` it takes longer

4.956804 seconds (892.41 k CPU allocations: 47.940 MiB, 0.13% gc time) (1 GPU allocation: 7.629 MiB, 0.00% memmgmt time)

(with `ρ_new = ρ`:

3.071523 seconds (890.81 k CPU allocations: 47.846 MiB, 0.21% gc time) (1 GPU allocation: 7.629 MiB, 0.00% memmgmt time)

)

In fact I am a bit lost when I see the number of ways to do `ρ_new = ρ` that take very different times…

If I write a kernel to do `ρ_new = ρ`:

``````function kernel_update!(ρ, ρ_new, Nx, Nz)
i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
j = (blockIdx().y - 1) * blockDim().y + threadIdx().y
if i <= Nx && j <= Nz
@inbounds ρ[i,j] = ρ_new[i,j]
end
return nothing
end
``````

it takes less time than with `copyto!()` but I have more CPU allocations:

4.751879 seconds (1.73 M CPU allocations: 91.440 MiB, 0.36% gc time) (1 GPU allocation: 7.629 MiB, 0.00% memmgmt time)

Hi! Of course, because ‘rho_new = rho’ really do nothing. If you want to make changes in array - you don’t need rho_new, if you want to get another array you should use copy or copyto! You can decrease allocations if put sim cycle in kernel.

probably

``````ρ, ρ_new = ρ_new, ρ
``````

is the most efficient way?

You’re calling your kernel lots of times within a for loop.
Maybe compare that to another kernel that contains that loop and is called just a single time.
Alternatively, you can maybe have a look at the graph execution API to speed calling the kernel in a loop. I don’t really have experience with this though, so not sure if your case applies.

1 Like

this is the best solution
so, I think that allocation not a problem - GC takes 0.36% of time

1 Like

It’s true ^^, thank you

The problem is that I need to have the entire array updated before going to the next step.

Thank you, I’ll try soon

sure, but that doesn’t prevent you from moving the time-step-loop into the kernel. You just need to make sure to place proper synchronization barriers in each iteration.
That should at least save you the kernel call overhead.
Not sure whether that will result in any speedup without a proper profiling, but it should be a quick thing to try…

You could try Stencils.jl for CUDA diffusions… all this is worked out for you and pretty fast.

Thank you, but in reality I don’t do diffusion.
I already checked Stencils.jl but it is not suitable for what I am doing

But… your running some kind of function of a stencil on GPU?

I’m not sure why it wouldn’t help?

I’ll check, I never tried to synchronize inside a kernel. I usualy call kernel step by step.
Typically, I perform a lot of operations within a single kernel.

You can just use stencils for the window part, you don’t need to use the whole `mapstencil`

I’m suggesting it because reading a static stencil from an array will be an order of magnitude faster than your loop/ranges with runtime valued size

Sorry, I made a mistake, I confused it with another package (ParallelStencil.jl).
There is no real documentation for Stencils.jl, but it seems that it is not suitable when I have multiple field to update in parallel ?
In this example I have only the density but usually in one kernel I update many fields (and not always the same).

If I want to synchronize inside a kernel I can synchronize only the threads of one block, then do have I to use only one block ? I was thinking it is impossible for large array

If the code is not independent across blocks (as is the case for your diffusion example), then this might indeed become difficult.
In that case I think you should first figure out where the bottleneck actually is and do some profiling.
There’s a great tutorial on how to do that by @maleadt here: GitHub - maleadt/cscs2023 (especially 1-3 and 1-4)