Correct utilisation of CUDA kernel for simulations

Ludovic_Dumoulin · February 7, 2024, 4:07pm

Hello,

I am new to julia and GPU computing, before today I didn’t care about optimisation (it was fast enough for me).

I use this kind of CUDA kernel (example for 2D diffusion using finite difference with euler forward):

using CUDA

function kernel_diff!(ρ_new, ρ, D, Nx, Nz)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    j = (blockIdx().y - 1) * blockDim().y + threadIdx().y
    if i <= Nx && j <= Nz
        i_ = mod(i-1,1:Nx); ip = mod(i+1,1:Nx)
        jm = mod(j-1,1:Nz); jp = mod(j+1,1:Nz)
        @inbounds ρ_new[i,j] = ρ[i,j] + D*(ρ[i_,j]+ρ[ip,j]+ρ[i,jm]+ρ[i,jp]-4*ρ[i,j])
    end
    return nothing
end

function diffusion!(ρ, D, Nt)
    Nx, Nz = size(ρ)
    WrapsT = 16
    Bx = ceil(Int, Nx/WrapsT)
    Bz = ceil(Int, Nz/WrapsT)
    block_dim = (WrapsT, WrapsT)
    grid_dim = (Bx, Bz)
    ρ_new = similar(ρ)
    for i=1:Nt
        @cuda threads = block_dim blocks = grid_dim kernel_diff!(ρ_new, ρ, D, Nx, Nz)
        ρ_new = ρ
    end
end

ρ = CUDA.rand(Float64, 1000, 1000)
D = 1e-3
Nt = 2e7

@CUDA.time CUDA.@sync diffusion!(ρ, D, Nt)

When I see the number of CPU allocations, i am wondering if i am doing things correctly.

What would I need to change to make it correct ?

Thank you for your help,

Best

PS : I know that there better way to solve diffusion problem, but I use way more complicated kernel to solve more complicated problem with complex boundary conditions. Diffusion is just an example here.

PharmCat · February 7, 2024, 5:15pm

Maybe you want copyto!(ρ, ρ_new)? And maybe just call diffusion!(ρ, D, Nt)

Is this what you want?:

using CUDA
function kernel_diff!(ρ_new, ρ, D, Nx, Nz)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    j = (blockIdx().y - 1) * blockDim().y + threadIdx().y
    if i <= Nx && j <= Nz
        i_ = mod(i-1,1:Nx); ip = mod(i+1,1:Nx)
        jm = mod(j-1,1:Nz); jp = mod(j+1,1:Nz)
        @inbounds ρ_new[i,j] = ρ[i,j] + D*(ρ[i_,j]+ρ[ip,j]+ρ[i,jm]+ρ[i,jp]-4*ρ[i,j])
    end
    return nothing
end
function diffusion!(ρ_new, ρ, D, Nt)
    Nx, Nz = size(ρ)
    gpukernel = @cuda launch=false kernel_diff!(ρ_new, ρ, D, Nx, Nz)
    config = launch_configuration(gpukernel.fun)
    maxThreads = config.threads
    Tx  = min(maxThreads, Nx)
    Ty  = min(fld(maxThreads, Tx), Nz)
    Bx, By = cld(Nx, Tx), cld(Nz, Ty)  # Blocks in grid.
    threads = (Tx, Ty)
    blocks  = Bx, By
    for i=1:Nt
        CUDA.@sync gpukernel(ρ_new, ρ, D, Nx, Nz; threads = threads, blocks = blocks)
        copyto!(ρ, ρ_new)
    end
    ρ_new
end

ρ = CUDA.rand(Float64, 1000, 1000)
ρ_new = similar(ρ)
D = 1e-3
Nt = 10000

using BenchmarkTools

@benchmark diffusion!($ρ_new, $ρ, $D, $Nt)

BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min … max):  3.158 s …    3.464 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.311 s               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.311 s ± 216.491 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                        █
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  3.16 s         Histogram: frequency by time         3.46 s <

 Memory estimate: 44.20 MiB, allocs estimate: 720281.

Ludovic_Dumoulin · February 8, 2024, 9:41am

Thank you for your reply,

If I use copyto!() it takes longer

4.956804 seconds (892.41 k CPU allocations: 47.940 MiB, 0.13% gc time) (1 GPU allocation: 7.629 MiB, 0.00% memmgmt time)

(with ρ_new = ρ:

3.071523 seconds (890.81 k CPU allocations: 47.846 MiB, 0.21% gc time) (1 GPU allocation: 7.629 MiB, 0.00% memmgmt time)

)

In fact I am a bit lost when I see the number of ways to do ρ_new = ρ that take very different times…

If I write a kernel to do ρ_new = ρ:

function kernel_update!(ρ, ρ_new, Nx, Nz)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    j = (blockIdx().y - 1) * blockDim().y + threadIdx().y
    if i <= Nx && j <= Nz
        @inbounds ρ[i,j] = ρ_new[i,j]
    end
    return nothing
end

it takes less time than with copyto!() but I have more CPU allocations:

4.751879 seconds (1.73 M CPU allocations: 91.440 MiB, 0.36% gc time) (1 GPU allocation: 7.629 MiB, 0.00% memmgmt time)

PharmCat · February 8, 2024, 9:57am

Hi! Of course, because ‘rho_new = rho’ really do nothing. If you want to make changes in array - you don’t need rho_new, if you want to get another array you should use copy or copyto! You can decrease allocations if put sim cycle in kernel.

trahflow · February 8, 2024, 10:39am

probably

ρ, ρ_new = ρ_new, ρ

is the most efficient way?

You’re calling your kernel lots of times within a for loop.
Maybe compare that to another kernel that contains that loop and is called just a single time.
Alternatively, you can maybe have a look at the graph execution API to speed calling the kernel in a loop. I don’t really have experience with this though, so not sure if your case applies.

PharmCat · February 8, 2024, 11:31am

this is the best solution
so, I think that allocation not a problem - GC takes 0.36% of time

Ludovic_Dumoulin · February 8, 2024, 3:24pm

It’s true ^^, thank you

The problem is that I need to have the entire array updated before going to the next step.

Thank you, I’ll try soon

trahflow · February 8, 2024, 3:49pm

sure, but that doesn’t prevent you from moving the time-step-loop into the kernel. You just need to make sure to place proper synchronization barriers in each iteration.
That should at least save you the kernel call overhead.
Not sure whether that will result in any speedup without a proper profiling, but it should be a quick thing to try…

Raf · February 8, 2024, 4:21pm

You could try Stencils.jl for CUDA diffusions… all this is worked out for you and pretty fast.

Ludovic_Dumoulin · February 8, 2024, 4:30pm

Thank you, but in reality I don’t do diffusion.
I already checked Stencils.jl but it is not suitable for what I am doing

Raf · February 8, 2024, 4:33pm

But… your running some kind of function of a stencil on GPU?

I’m not sure why it wouldn’t help?

Ludovic_Dumoulin · February 8, 2024, 4:34pm

I’ll check, I never tried to synchronize inside a kernel. I usualy call kernel step by step.
Typically, I perform a lot of operations within a single kernel.

Raf · February 8, 2024, 4:37pm

You can just use stencils for the window part, you don’t need to use the whole mapstencil…

I’m suggesting it because reading a static stencil from an array will be an order of magnitude faster than your loop/ranges with runtime valued size

Ludovic_Dumoulin · February 9, 2024, 12:47pm

Sorry, I made a mistake, I confused it with another package (ParallelStencil.jl).
There is no real documentation for Stencils.jl, but it seems that it is not suitable when I have multiple field to update in parallel ?
In this example I have only the density but usually in one kernel I update many fields (and not always the same).

Ludovic_Dumoulin · February 9, 2024, 2:54pm

If I want to synchronize inside a kernel I can synchronize only the threads of one block, then do have I to use only one block ? I was thinking it is impossible for large array

trahflow · February 9, 2024, 3:19pm

If the code is not independent across blocks (as is the case for your diffusion example), then this might indeed become difficult.
In that case I think you should first figure out where the bottleneck actually is and do some profiling.
There’s a great tutorial on how to do that by @maleadt here: GitHub - maleadt/cscs2023 (especially 1-3 and 1-4)

Ludovic_Dumoulin · February 13, 2025, 9:53am

Hello,
I’m sorry to revive this old conversation. Since my issue is entirely related to this thread and follows its continuity, I assumed this was the right approach.

In Float64, ρ, ρ_new = ρ_new, ρ works very well, but as soon as I switch to Float32, I get this error:

ERROR: LoadError: MethodError: Cannot `convert` an object of type 
  CuDeviceArray{Float32,2,1} to an object of type 
  CuDeviceArray{Float64,2,1}

This is really surprising to me because copyto!(ρ, ρ_new) works fine regardless of the float type.

Thank you for your help!
Best

Topic		Replies	Views
How do I get allocations down but keep code speed? GPU	9	330	January 28, 2023
Simulation running well on K620 but instable on A100? GPU gpu , cuda	15	1173	March 25, 2022
Inplace array modification performances GPU	4	615	July 20, 2022
Some CUDA functions suddenly become very slow New to Julia	3	198	July 14, 2024
Accessing array elements too slow? GPU	10	593	April 23, 2021

Correct utilisation of CUDA kernel for simulations

Related topics