Hello,

I am new to julia and GPU computing, before today I didn’t care about optimisation (it was fast enough for me).

I use this kind of CUDA kernel (example for 2D diffusion using finite difference with euler forward):

```
using CUDA
function kernel_diff!(ρ_new, ρ, D, Nx, Nz)
i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
j = (blockIdx().y - 1) * blockDim().y + threadIdx().y
if i <= Nx && j <= Nz
i_ = mod(i-1,1:Nx); ip = mod(i+1,1:Nx)
jm = mod(j-1,1:Nz); jp = mod(j+1,1:Nz)
@inbounds ρ_new[i,j] = ρ[i,j] + D*(ρ[i_,j]+ρ[ip,j]+ρ[i,jm]+ρ[i,jp]-4*ρ[i,j])
end
return nothing
end
function diffusion!(ρ, D, Nt)
Nx, Nz = size(ρ)
WrapsT = 16
Bx = ceil(Int, Nx/WrapsT)
Bz = ceil(Int, Nz/WrapsT)
block_dim = (WrapsT, WrapsT)
grid_dim = (Bx, Bz)
ρ_new = similar(ρ)
for i=1:Nt
@cuda threads = block_dim blocks = grid_dim kernel_diff!(ρ_new, ρ, D, Nx, Nz)
ρ_new = ρ
end
end
ρ = CUDA.rand(Float64, 1000, 1000)
D = 1e-3
Nt = 2e7
@CUDA.time CUDA.@sync diffusion!(ρ, D, Nt)
```

When I see the number of CPU allocations, i am wondering if i am doing things correctly.

What would I need to change to make it correct ?

Thank you for your help,

Best

PS : I know that there better way to solve diffusion problem, but I use way more complicated kernel to solve more complicated problem with complex boundary conditions. Diffusion is just an example here.