Simulation running well on K620 but instable on A100?

Ludovic_Dumoulin · March 8, 2022, 1:05pm

Hello,

I am facing a difficult situation.
I am new to GPU computing and I am debugging my code on my computer gpu (an old K620) and then run my simulation on a cluster with A100 cards.

The simulation is stable for a large number of time-step on the K620 (at least 1e6 time-steps) but gives very different results with obvious numerical artifacts on the A100 (even after 5’000 time-steps). The code is the same.

The differences are:

Julia 1.6.1 on my computer and 1.6.2 on the cluster.
I am using Jupyter lab on my windows computer and directly Julia on the linux cluster
The CUDA capability 5.0 on my computer and 8.0 on the cluster

At the beginning I was thinking about synchronization issue between threads, but after putting CUDA.@sync and synchonize() everywhere nothing changed…

If I type CUDA.versioninfo()
I get this on my computer:

CUDA toolkit 11.4.1, artifact installation
CUDA driver 11.3.0
Libraries:

CUBLAS: 11.5.4

CURAND: 10.2.5

CUFFT: 10.5.1

CUSOLVER: 11.2.0

CUSPARSE: 11.6.0

CUPTI: 14.0.0

NVML: missing

CUDNN: 8.20.2 (for CUDA 11.4.0)

CUTENSOR: 1.3.0 (for CUDA 11.2.0)
Toolchain:

Julia: 1.6.1

LLVM: 11.0.1

PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0

Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80
1 device:
0: NVIDIA Quadro K620 (sm_50, 1.644 GiB / 2.000 GiB available)

For the cluster I am waiting the answer (I am on the waiting list).

It might still be a synchronization issue because I don’t know how it works.

Thank you in advance for your help,
If you need more information I will try to get them.

Best,

Ludovic

carstenbauer · March 8, 2022, 1:27pm

Some general advice: It is very hard to help you without seeing any code. If your simulation code is too large, try to extract a MWE (minimal working example) that shows the issue.

machineko · March 8, 2022, 2:36pm

Did u try to run it on only one A100.

If yes check the precision (A100 can use TF32 What is the TensorFloat-32 Precision Format? | NVIDIA Blog ) try to turn it off [if its on] and force full FP32 mode for your calculation.

Ludovic_Dumoulin · March 8, 2022, 2:52pm

I run the simulation on only 1 A100. (I should have specify it)

As it is for hydrodynamic simulations, I use Float64 by creating all my CuArrays using CUDA.zeros(Float64, ...). Do you think it is possible that the A100 is converting it in FP32 and use TF32 behind ?

Thank you for your prompt reply

Ludovic_Dumoulin · March 8, 2022, 3:15pm

I am trying to have a MWE showing the issue. This problem arise when I couple 3 different numerical methods (Lattice Boltzmann (LBM), Phase-field (PF), and Immersed boundaries (IBM)). IBM and PF modify a force CuArray, then LBM read it. When I remove one of these methods everything works but it is normal because nothing is moving.

I am trying to simulate an easier and more typical hydrodynamic problem but the waiting list to access to the cluster is of approximately one day. As the problem does not appear on my personal computer I can’t test it yet.

Thank you for your prompt reply

maleadt · March 8, 2022, 3:20pm

TF32 is only used by CUDA.jl if you start Julia with fast math enabled.

I have access to a A100 for testing, so if you could create a MWE that shows the discrepancy I can have a look.

Ludovic_Dumoulin · March 8, 2022, 4:30pm

I removed all the PF part and modified directly the force term in order to get a simple Poiseuille flow. But the code is still un bit long to be copy paste here I think:

#   Package 
#   ≡≡≡≡≡≡≡≡≡≡≡≡≡
using CUDA
using Statistics
Ti = Int64
Tf = Float64
Ta = CuArray

WarpsT = 16
Bx = 8
Bz = 8
thrds2D = (WarpsT, WarpsT)
blcks2D = (Bx, Bz)

#   LBM
#   ≡≡≡≡≡
#   Parameters
#   ============
# Size
Nx = Ti(WarpsT * Bx)
Nz = Ti(WarpsT * Bz)
# Weigth
w = Ta([4/9 1/9 1/9 1/9 1/9 1/36 1/36 1/36 1/36])
# Lattice Velocities
ξ = Ta([0 0; 1 0; 0 1; -1 0; 0 -1; 1 1; -1 1; -1 -1; 1 -1]);

#   Functions
#   ===========
#   Initialization of population
#   ––––––––––––––––––––––––––––––
function kernel_initialize_pop!(f, w)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    j = (blockIdx().y - 1) * blockDim().y + threadIdx().y
    for k in eachindex(w)
        f[i,j,k] = w[k]
    end
    return nothing
end
@inline initialize!(f,w) = @cuda threads = thrds2D blocks = blcks2D kernel_initialize_pop!(f, w)

#   Collide and stream (periodic)
#   –––––––––––––––––––––––––––––––
function kernel_collide_stream!(fn, f, F, ρ_, v, ω, Nx, Nz)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    j = (blockIdx().y - 1) * blockDim().y + threadIdx().y
    @inbounds begin
        
        # For streaming
        i_ = mod(i-1,1:Nx); ip = mod(i+1,1:Nx)
        jm = mod(j-1,1:Nz); jp = mod(j+1,1:Nz)
        
        # Density and velocity
        ρ = ρ_[i,j]; vx = v[i,j,1]; vz = v[i,j,2]
        ρvx = ρ*vx; ρvz=ρ*vz
        # Source term factor
        ω_ = (1-0.5*ω)*1/3
        Fx=F[i,j,1]; Fz=F[i,j,2]
       
        # Post collision-Streaming populations
        ρvx2 = ρ*vx*vx; ρvz2 = ρ*vz*vz; ρvxvz = vx*vz*ρ; ω__ = 1-ω
        fn[i,j,1] = ω__*f[i,j,1] + ω*4/9*(ρ                   - 1.5*(ρvx2+ρvz2) )      + 4*ω_*(        -Fx*vx-Fz*vz) # ξ_0 = (0,0)
        fn[ip, j, 2] = ω__*f[i,j,2] + ω*1/9*(ρ + 3*ρvx         + 3*ρvx2 - 1.5*ρvz2 )      + ω_*(Fx      +2*vx*Fx-Fz*vz) # ξ_1 = (1,0)
        fn[i_, j, 4] = ω__*f[i,j,4] + ω*1/9*(ρ - 3*ρvx         + 3*ρvx2 - 1.5*ρvz2 )      + ω_*(-Fx     +2*vx*Fx-Fz*vz) # ξ_3 = (-1,0)
        fn[i, jp, 3] = ω__*f[i,j,3] + ω*1/9*(ρ + 3*ρvz         + 3*ρvz2 - 1.5*ρvx2 )      + ω_*(Fz      +2*vz*Fz-Fx*vx) # ξ_2 = (0,1)
        fn[i, jm, 5] = ω__*f[i,j,5] + ω*1/9*(ρ - 3*ρvz         + 3*ρvz2 - 1.5*ρvx2 )      + ω_*(-Fz     +2*vz*Fz-Fx*vx) # ξ_4 = (0,-1)
        fn[ip, jp, 6] = ω__*f[i,j,6] + ω*1/36*(ρ + 3*(ρvx+ρvz) + 3*(ρvx2+ρvz2) + 9*ρvxvz ) + 0.25*ω_*(Fx+Fz +2*(vx*Fx+vz*Fz) +3*(Fx*vz+vx*Fz) ) # ξ_5 = (1,1)
        fn[i_, jp, 7] = ω__*f[i,j,7] + ω*1/36*(ρ - 3*(ρvx-ρvz) + 3*(ρvx2+ρvz2) - 9*ρvxvz ) + 0.25*ω_*(-Fx+Fz +2*(vx*Fx+vz*Fz) -3*(Fx*vz+vx*Fz) ) # ξ_6 = (-1,1)
        fn[i_, jm, 8] = ω__*f[i,j,8] + ω*1/36*(ρ - 3*(ρvx+ρvz) + 3*(ρvx2+ρvz2) + 9*ρvxvz ) + 0.25*ω_*(-Fx-Fz +2*(vx*Fx+vz*Fz) +3*(Fx*vz+vx*Fz) ) # ξ_7 = (-1,-1)
        fn[ip, jm, 9] = ω__*f[i,j,9] + ω*1/36*(ρ + 3*(ρvx-ρvz) + 3*(ρvx2+ρvz2) - 9*ρvxvz ) + 0.25*ω_*(Fx-Fz +2*(vx*Fx+vz*Fz) -3*(Fx*vz+vx*Fz) ) # ξ_8 = (1,-1)
    end
    return nothing
end


#   Compute moments (ρ, v)
#   ––––––––––––––––––––––––
function kernel_comp_ρ_v!(ρ, v, f, F)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    j = (blockIdx().y - 1) * blockDim().y + threadIdx().y
    @inbounds ρ[i, j] = f[i, j, 1] + f[i, j, 2] + f[i, j, 3] + f[i, j, 4] + f[i, j, 5] +
        f[i, j, 6] + f[i, j, 7] + f[i, j, 8] + f[i, j, 9]
    @inbounds v[i, j, 1] = (f[i, j, 2] - f[i, j, 4] + f[i, j, 6] - f[i, j, 7] -
        f[i, j, 8] + f[i, j, 9] + 0.5*F[i,j,1])/ρ[i,j]
    @inbounds v[i, j, 2] = (f[i, j, 3] - f[i, j, 5] + f[i, j, 6] + f[i, j, 7] -
        f[i, j, 8] - f[i, j, 9] + 0.5*F[i,j,2])/ρ[i,j]
    return nothing
end

@inline comp_ρ_v!(ρ, v, f, F) = @cuda threads = thrds2D blocks = blcks2D kernel_comp_ρ_v!(ρ, v, f, F)

#   LBM function
#   ––––––––––––––
function LBM_periodic!(f, f_t, ρ, v, F)
    @cuda threads = thrds2D blocks = blcks2D kernel_collide_stream!(f_t, f, F, ρ, v, ω, Nx, Nz)
    @. f = f_t
    comp_ρ_v!(ρ, v, f, F)
end

#   Immersed Boundary
#   ≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡
#   Stencil
#   =========

@inline function stencil_4(dist)
    if dist<1.
        return 0.125*(3-2*dist+CUDA.sqrt(1+4*dist-4*dist*dist))
    elseif dist<2.
        return 0.125*(5-2*dist-CUDA.sqrt(-7+12*dist-4*dist*dist))
    else 
        return 0.0
    end
end

#   Interpolate the fluid velocity to the object nodes
#   ====================================================
function interpolate!(boundaries, v, stencil)
    for i in 1:length(boundaries)
        boundary = boundaries[i]
        n_nodes = size(boundary)[1]
        B = trunc(Ti, n_nodes/WarpsT)+1
        @cuda threads = WarpsT blocks = B kernel_interpolate!(boundary, n_nodes, v, Nx, Nz, stencil)
    end
end

function kernel_interpolate!(boundary, n_nodes, v, Nx, Nz, stencil)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    
    if i > n_nodes
        return nothing
    end
    
    boundary[i,5] = 0
    boundary[i,6] = 0
    
    x_int = trunc(Int, boundary[i,1])   
    y_int = trunc(Int, boundary[i,2])  

    xs = x_int-1; xe = x_int+2; ys = y_int-1; ye = y_int+2
    
    for x=xs:xe
        for y=ys:ye
            xx = mod(x,1:Nx)
            yy = mod(y,1:Nz)
                       
            wx = stencil(abs(x+0.5 - boundary[i,1]))
            wy = stencil(abs(y+0.5 - boundary[i,2]))
            
            boundary[i,5] += v[xx,yy,1]*wx*wy
            boundary[i,6] += v[xx,yy,2]*wx*wy
        end
    end
    return nothing
end

#   update node positions
#   =======================
function update_particle_position!(boundaries)
    for boundary in boundaries
        n_nodes = size(boundary)[1]
        B = trunc(Ti, n_nodes/WarpsT)+1
        @cuda threads = WarpsT blocks = B kernel_update_particle_position!(boundary, n_nodes, Nx, Nz)
    end
end

function kernel_update_particle_position!(boundary, n_nodes, Nx, Nz)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    
    if i > n_nodes
        return nothing
    end
    
    boundary[i,1] = boundary[i,1] + boundary[i,5]
    boundary[i,2] = boundary[i,2] + boundary[i,6]
    
    return nothing
end

#   compute node forces based on the object's deformation
#   =======================================================
function compute_particle_forces!(boundaries)
    for i in 1:length(boundaries)
        n_nodes = size(boundaries[i])[1]
        B = trunc(Ti, n_nodes/WarpsT)+1
        @cuda threads = WarpsT blocks = B kernel_comp_particle_forces!(boundaries[i], n_nodes, Nx)
    end
end


function kernel_comp_particle_forces!(boundary, n_nodes, Nx)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    
    if i > n_nodes
        return nothing
    end
    
    # Compute area belonging to a node.
    # The area is used to weight the penalty force; this way the total force is independent of the number of IB nodes.
    area = Nx / n_nodes
    
    boundary[i,7] = -boundary[i,9] * (boundary[i,1] - boundary[i,3]) * area
    boundary[i,8] = -boundary[i,9] * (boundary[i,2] - boundary[i,4]) * area
    
    return nothing
end

#   spread forces to the fluid lattice
#   ====================================
function spread!(boundaries, F, stencil)
    for i in 1:length(boundaries)
        n_nodes = size(boundaries[i])[1]
        B = trunc(Ti, n_nodes/WarpsT)+1
        @cuda threads = WarpsT blocks = B kernel_spread!(boundaries[i], F, Nx, Nz, n_nodes, stencil)
    end
end

function kernel_spread!(boundary, F, Nx, Nz, n_nodes, stencil)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    
    if i > n_nodes
        return nothing
    end
    
    x_int = trunc(Int, boundary[i,1])
    y_int = trunc(Int, boundary[i,2])

    xs = x_int-1; xe = x_int+2; ys = y_int-1; ye = y_int+2
    
    for x=xs:xe
        for y=ys:ye
            xx = mod(x,1:Nx)
            yy = mod(y,1:Nz)

            wx = stencil(abs(boundary[i,1] - 0.5 - x))
            wy = stencil(abs(boundary[i,2] - 0.5 - y))
            
            F[xx,yy,1] += boundary[i,7]*wx*wy
            F[xx,yy,2] += boundary[i,8]*wx*wy
        end
    end
    return nothing
end

#   IBM function
#   ==============
@inline function IBM!(boundaries, v, F, stencil)
    interpolate!(boundaries, v, stencil)
    update_particle_position!(boundaries)
    compute_particle_forces!(boundaries)
    spread!(boundaries, F, stencil)
end

#   Building object
#   ≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡
function build_wall(n_nodes, begx, begz, endx, endz, stiff)
    wall = CUDA.zeros(Tf, n_nodes, 9)
    B = trunc(Ti, n_nodes/WarpsT)+1
    @cuda threads = WarpsT blocks = B kernel_wall!(wall, n_nodes, begx, begz, endx, endz, stiff, Nx, Nz)
    return wall
end

function kernel_wall!(wall, n_nodes, begx, begz, endx, endz, stiff, Nx, Nz)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    
    if i > n_nodes
        return nothing
    end
    posx = begx + (i-1)*(endx-begx)/n_nodes
    posx = mod(posx, Nx)
    posz = begz + (i-1)*(endz-begz)/n_nodes
    posz = mod(posz, Nz) 
    wall[i,1] = posx
    wall[i,3] = posx
    wall[i,2] = posz
    wall[i,4] = posz
    wall[i,9] = stiff
    return nothing
end

function build_circle(n_nodes, posx, posz, radius, stiff)
    circle = CUDA.zeros(Tf, n_nodes, 9)
    B = trunc(Ti, n_nodes/WarpsT)+1
    @cuda threads = WarpsT blocks = B kernel_circle!(circle, n_nodes, posx, posz, radius, stiff, Nx, Nz)
    return circle
end

function kernel_circle!(circle, n_nodes, posx, posz, radius, stiff, Nx, Nz)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    
    if i > n_nodes
        return nothing
    end
    posx = posx + cos(2*π*(i-1)/n_nodes)*radius
    posx = mod(posx, Nx)
    posz = posz + sin(2*π*(i-1)/n_nodes)*radius
    posz = mod(posz, Nz)
    circle[i,1] = posx
    circle[i,3] = posx
    circle[i,2] = posz
    circle[i,4] = posz
    circle[i,9]= stiff
    return nothing
end

#   Poiseuille
#   ≡≡≡≡≡≡≡≡≡≡≡≡
#   Parameters
#   ============
WarpsT = 16
Bx = 8
Bz = 8+2
thrds2D = (WarpsT, WarpsT)
blcks2D = (Bx, Bz)

Nx = Ti(WarpsT * Bx)
Nz = Ti(WarpsT * Bz)

# Wall properties
wall_num_nodes = Nx*10
wall_width = Nz-2*WarpsT #(Nz-2)*0.9
wall_stiffness = 1.0 # IBM spring constant

NΔt = 50000

τ=sqrt(3/16)+0.5      # relaxation time (BGK model) 
ω=1/τ

vx_max=0.001            # maximum velocity
ν=(2*τ-1)/6           # kinematic shear viscosity

∇P = 8*ν*vx_max/wall_width^2  # Pressure gradient

Re_fact = wall_width/ν

#   Initialization
#   ================
f = CUDA.zeros(Tf, Nx, Nz, 9)
f_t = CUDA.zeros(Tf, Nx, Nz, 9)
ρ = CUDA.ones(Tf, Nx, Nz)
v = CUDA.zeros(Tf, Nx, Nz, 2)
F = CUDA.zeros(Tf, Nx, Nz, 2)

boundaries = Vector{CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}}(undef,2)
boundaries_cpu = Vector{Array{Float64, 2}}(undef,2)

function initialize!(f,w, boundaries)
    initialize!(f,w)
    begx = 0.0; begz = (Nz-wall_width)*0.5
    endx = Nx-0.5; endz = begz
    boundaries[1] = build_wall(wall_num_nodes, begx, begz, endx, endz, wall_stiffness)
    begx = 0.0; begz = (Nz+wall_width)*0.5
    endx = Nx-0.5; endz = begz
    boundaries[2] = build_wall(wall_num_nodes, begx, begz, endx, endz, wall_stiffness)
    return nothing
end

#   Time loop
#   ===========
initialize!(f, w, boundaries)

for t=0:NΔt
    @. F[:,:,1] = ∇P
    @. F[:,:,2] = 0.0
    IBM!(boundaries, v, F, stencil_4)
    LBM_periodic!(f, f_t, ρ, v, F)
end

#   Test Error
#   ===========
for boundary in 1:length(boundaries)
    boundaries_cpu[boundary] = Array(boundaries[boundary])
end

err = 0
for boundary in boundaries_cpu
    err += std(boundary[:,2])
end
println("Error: ",err)

I have to warn you that this code is from a beginner. Surely there are plenty of mistake concerning the optimization and the “right way to write”. I am open to feedback.

Thank you for your help.

EDIT: Now it returns directly the error and saves nothing, on my K620 the error is in the order of 10^{-6}, on the A100 of 10^{-3}.

maleadt · March 8, 2022, 5:16pm

Can you simplify it to something that demonstrates wrong values being computed? For example, the expected value or maximum tolerated error of a single iteration, or so?

Ludovic_Dumoulin · March 9, 2022, 10:28am

I simplified and edited it this morning, now it returns one number at the end. I also reduced the number of time-step to make it really quick (1min maxi).

Thank you

maleadt · March 9, 2022, 12:51pm

That’s still a lot of code, I won’t be able to go through that. Also, I’m seeing massive errors (> 1e0) on every system I tested this on. If it could be of any help, I could give you shell access to a system with such a GPU, so that you can more easily work on further reducing your code.

Ludovic_Dumoulin · March 9, 2022, 5:16pm

Thank you, it is always help full.

I was trying to get a small code and I think I had written a mistake? (but I don’t know)

Ludovic_Dumoulin:

for i in 1:length(boundaries)
        boundary = boundaries[i] #!!!!!!!!!!
        ...
        @cuda threads = WarpsT blocks = B kernel_interpolate!(boundary, ... #interpoalte! is suppose to modify boundaries[i]

Unfortunately I cannot try to correct it yet: the cluster is under maintenance until Friday evening…
I will try during this weekend.
As my problem is not so urgent, I can wait this weekend. If it still does not work I might accept a shell access to a A100 in order to reduce my code/find the mistake.

Ludovic_Dumoulin · March 23, 2022, 12:11pm

Hello, I am back form vacation and the cluster is up since yesterday.

I have this GPU code (I cannot reduce it more)

#   Package 
#   ≡≡≡≡≡≡≡≡≡≡≡≡≡
using CUDA
using Statistics
Ti = Int64
Tf = Float64
Ta = CuArray

WarpsT = 16
Bx = 8
Bz = 8
thrds2D = (WarpsT, WarpsT)
blcks2D = (Bx, Bz)
# Size
Nx = Ti(WarpsT * Bx)
Nz = Ti(WarpsT * Bz)

# Weigth
w = Ta([4/9 1/9 1/9 1/9 1/9 1/36 1/36 1/36 1/36])

#   Initialization of population
#   ––––––––––––––––––––––––––––––
function kernel_initialize_pop!(f, w)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    j = (blockIdx().y - 1) * blockDim().y + threadIdx().y
    for k in eachindex(w)
        f[i,j,k] = w[k]
    end
    return nothing
end

#   Collide and stream (periodic)
#   –––––––––––––––––––––––––––––––
function kernel_collide_stream!(fn, f, F, ρ_, v, ω, Nx, Nz)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    j = (blockIdx().y - 1) * blockDim().y + threadIdx().y
    @inbounds begin
        
        # For streaming
        i_ = mod(i-1,1:Nx); ip = mod(i+1,1:Nx)
        jm = mod(j-1,1:Nz); jp = mod(j+1,1:Nz)
        
        # Density and velocity
        ρ = ρ_[i,j]; vx = v[i,j,1]; vz = v[i,j,2]
        ρvx = ρ*vx; ρvz=ρ*vz
        # Source term factor
        ω_ = (1-0.5*ω)*1/3
        Fx=F[i,j,1]; Fz=F[i,j,2]
       
        # Post collision-Streaming populations
        ρvx2 = ρ*vx*vx; ρvz2 = ρ*vz*vz; ρvxvz = vx*vz*ρ; ω__ = 1-ω
        fn[i,j,1] = ω__*f[i,j,1] + ω*4/9*(ρ                   - 1.5*(ρvx2+ρvz2) )      + 4*ω_*(        -Fx*vx-Fz*vz) # ξ_0 = (0,0)
        fn[ip, j, 2] = ω__*f[i,j,2] + ω*1/9*(ρ + 3*ρvx         + 3*ρvx2 - 1.5*ρvz2 )      + ω_*(Fx      +2*vx*Fx-Fz*vz) # ξ_1 = (1,0)
        fn[i_, j, 4] = ω__*f[i,j,4] + ω*1/9*(ρ - 3*ρvx         + 3*ρvx2 - 1.5*ρvz2 )      + ω_*(-Fx     +2*vx*Fx-Fz*vz) # ξ_3 = (-1,0)
        fn[i, jp, 3] = ω__*f[i,j,3] + ω*1/9*(ρ + 3*ρvz         + 3*ρvz2 - 1.5*ρvx2 )      + ω_*(Fz      +2*vz*Fz-Fx*vx) # ξ_2 = (0,1)
        fn[i, jm, 5] = ω__*f[i,j,5] + ω*1/9*(ρ - 3*ρvz         + 3*ρvz2 - 1.5*ρvx2 )      + ω_*(-Fz     +2*vz*Fz-Fx*vx) # ξ_4 = (0,-1)
        fn[ip, jp, 6] = ω__*f[i,j,6] + ω*1/36*(ρ + 3*(ρvx+ρvz) + 3*(ρvx2+ρvz2) + 9*ρvxvz ) + 0.25*ω_*(Fx+Fz +2*(vx*Fx+vz*Fz) +3*(Fx*vz+vx*Fz) ) # ξ_5 = (1,1)
        fn[i_, jp, 7] = ω__*f[i,j,7] + ω*1/36*(ρ - 3*(ρvx-ρvz) + 3*(ρvx2+ρvz2) - 9*ρvxvz ) + 0.25*ω_*(-Fx+Fz +2*(vx*Fx+vz*Fz) -3*(Fx*vz+vx*Fz) ) # ξ_6 = (-1,1)
        fn[i_, jm, 8] = ω__*f[i,j,8] + ω*1/36*(ρ - 3*(ρvx+ρvz) + 3*(ρvx2+ρvz2) + 9*ρvxvz ) + 0.25*ω_*(-Fx-Fz +2*(vx*Fx+vz*Fz) +3*(Fx*vz+vx*Fz) ) # ξ_7 = (-1,-1)
        fn[ip, jm, 9] = ω__*f[i,j,9] + ω*1/36*(ρ + 3*(ρvx-ρvz) + 3*(ρvx2+ρvz2) - 9*ρvxvz ) + 0.25*ω_*(Fx-Fz +2*(vx*Fx+vz*Fz) -3*(Fx*vz+vx*Fz) ) # ξ_8 = (1,-1)
    end
    return nothing
end


#   Compute moments (ρ, v)
#   ––––––––––––––––––––––––
function kernel_comp_ρ_v!(ρ, v, f, F)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    j = (blockIdx().y - 1) * blockDim().y + threadIdx().y
    @inbounds ρ[i, j] = f[i, j, 1] + f[i, j, 2] + f[i, j, 3] + f[i, j, 4] + f[i, j, 5] +
        f[i, j, 6] + f[i, j, 7] + f[i, j, 8] + f[i, j, 9]
    @inbounds v[i, j, 1] = (f[i, j, 2] - f[i, j, 4] + f[i, j, 6] - f[i, j, 7] -
        f[i, j, 8] + f[i, j, 9] + 0.5*F[i,j,1])/ρ[i,j]
    @inbounds v[i, j, 2] = (f[i, j, 3] - f[i, j, 5] + f[i, j, 6] + f[i, j, 7] -
        f[i, j, 8] - f[i, j, 9] + 0.5*F[i,j,2])/ρ[i,j]
    return nothing
end



#   Stencil
#   =========

@inline function stencil_4(dist)
    if dist<1.
        return Float64(0.125*(3-2*dist+CUDA.sqrt(1+4*dist-4*dist*dist)))
    elseif dist<2.
        return Float64(0.125*(5-2*dist-CUDA.sqrt(-7+12*dist-4*dist*dist)))
    else 
        return Float64(0.0)
    end
end

#   Interpolate the fluid velocity to the object nodes
#   ====================================================
function kernel_interpolate!(boundary, n_nodes, v, Nx, Nz, stencil)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    
    if i > n_nodes
        return nothing
    end
    
    boundary[i,5] = 0
    boundary[i,6] = 0
    
    x_int = trunc(Int, boundary[i,1])   
    y_int = trunc(Int, boundary[i,2])  

    xs = x_int-1; xe = x_int+2; ys = y_int-1; ye = y_int+2
    
    for x=xs:xe
        for y=ys:ye
            xx = mod(x,1:Nx)
            yy = mod(y,1:Nz)
                       
            wx = stencil(Float64(abs(x+0.5 - boundary[i,1])))
            wy = stencil(Float64(abs(y+0.5 - boundary[i,2])))
            
            boundary[i,5] += v[xx,yy,1]*wx*wy
            boundary[i,6] += v[xx,yy,2]*wx*wy
        end
    end
    return nothing
end

#   update node positions
#   =======================
function kernel_update_particle_position!(boundary, n_nodes, Nx, Nz)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    
    if i > n_nodes
        return nothing
    end
    
    boundary[i,1] = boundary[i,1] + boundary[i,5]
    boundary[i,2] = boundary[i,2] + boundary[i,6]
    
    return nothing
end

#   compute node forces based on the object's deformation
#   =======================================================
function kernel_comp_particle_forces!(boundary, n_nodes, Nx)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    
    if i > n_nodes
        return nothing
    end

    area = Nx / n_nodes
    
    boundary[i,7] = -boundary[i,9] * (boundary[i,1] - boundary[i,3]) * area
    boundary[i,8] = -boundary[i,9] * (boundary[i,2] - boundary[i,4]) * area
    
    return nothing
end

#   spread forces to the fluid lattice
#   ====================================
function kernel_spread!(boundary, F, Nx, Nz, n_nodes, stencil)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    
    if i > n_nodes
        return nothing
    end
    
    x_int = trunc(Int, boundary[i,1])
    y_int = trunc(Int, boundary[i,2])

    xs = x_int-1; xe = x_int+2; ys = y_int-1; ye = y_int+2
    
    for x=xs:xe
        for y=ys:ye
            xx = mod(x,1:Nx)
            yy = mod(y,1:Nz)

            wx = stencil(Float64(abs(boundary[i,1] - 0.5 - x)))
            wy = stencil(Float64(abs(boundary[i,2] - 0.5 - y)))
            
            F[xx,yy,1] += boundary[i,7]*wx*wy
            F[xx,yy,2] += boundary[i,8]*wx*wy
        end
    end
    return nothing
end


#   Building object
#   ≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡
function kernel_wall!(wall, n_nodes, begx, begz, endx, endz, stiff, Nx, Nz)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    
    if i > n_nodes
        return nothing
    end
    posx = begx + (i-1)*(endx-begx)/n_nodes
    posx = mod(posx, Nx)
    posz = begz + (i-1)*(endz-begz)/n_nodes
    posz = mod(posz, Nz) 
    wall[i,1] = posx
    wall[i,3] = posx
    wall[i,2] = posz
    wall[i,4] = posz
    wall[i,9] = stiff
    return nothing
end


#   Poiseuille
#   ≡≡≡≡≡≡≡≡≡≡≡≡
#   Parameters
#   ============
wall_num_nodes = Nx*10
wall_stiffness = 1.0 # IBM spring constant
B = trunc(Ti, wall_num_nodes/WarpsT)+1

NΔt = 10000

τ=sqrt(3/16)+0.5      # relaxation time (BGK model) 
ω=1/τ

vx_max=0.001            # maximum velocity
ν=(2*τ-1)/6           # kinematic shear viscosity

∇P = Tf(8*ν*vx_max/(Nz^2))  # Pressure gradient

#   Initialization
#   ================
f = CUDA.zeros(Tf, Nx, Nz, 9)
f_t = CUDA.zeros(Tf, Nx, Nz, 9)
ρ = CUDA.ones(Tf, Nx, Nz)
v = CUDA.zeros(Tf, Nx, Nz, 2)
F = CUDA.zeros(Tf, Nx, Nz, 2)

boundary = CUDA.zeros(Tf, wall_num_nodes, 9)
boundary_cpu = zeros(Tf, wall_num_nodes, 9)

#   Time loop
#   ===========
@cuda threads = thrds2D blocks = blcks2D kernel_initialize_pop!(f, w)
@cuda threads = WarpsT blocks = B kernel_wall!(boundary, wall_num_nodes, 0.0, Nz*0.5, Nx-0.5, Nz*0.5, wall_stiffness, Nx, Nz)
for t=0:NΔt
    @. F[:,:,1] = ∇P
    @. F[:,:,2] = 0.0
    @cuda threads = WarpsT blocks = B kernel_interpolate!(boundary, wall_num_nodes, v, Nx, Nz, stencil_4)
    @cuda threads = WarpsT blocks = B kernel_update_particle_position!(boundary, wall_num_nodes, Nx, Nz)
    @cuda threads = WarpsT blocks = B kernel_comp_particle_forces!(boundary, wall_num_nodes, Nx)
    @cuda threads = WarpsT blocks = B kernel_spread!(boundary, F, Nx, Nz, wall_num_nodes, stencil_4)
    @cuda threads = thrds2D blocks = blcks2D kernel_collide_stream!(f_t, f, F, ρ, v, ω, Nx, Nz)
    @. f = f_t
    @cuda threads = thrds2D blocks = blcks2D kernel_comp_ρ_v!(ρ, v, f, F)
end

boundary_cpu= boundary;

err = std(boundary_cpu[:,2])
println("Error: ", err)

Corresponding to this CPU code (I just changed the gpu idexing by for loop(s)

#   Package 
#   ≡≡≡≡≡≡≡≡≡≡≡≡≡
using Statistics
Ti = Int64
Tf = Float64
Ta = Array

# Size
Nx = Ti(128)
Nz = Ti(128)
# Weigth
w = Ta([4/9 1/9 1/9 1/9 1/9 1/36 1/36 1/36 1/36])

#   Initialization of population
#   ––––––––––––––––––––––––––––––
function cpu_initialize_pop!(f, w)
    for i in 1:Nx, j in 1:Nz, k in eachindex(w)
        f[i,j,k] = w[k]
    end
    return nothing
end

#   Collide and stream (periodic)
#   –––––––––––––––––––––––––––––––
function cpu_collide_stream!(fn, f, F, ρ_, v, ω, Nx, Nz)
    for i in 1:Nx, j in 1:Nz
        @inbounds begin

            # For streaming
            i_ = mod(i-1,1:Nx); ip = mod(i+1,1:Nx)
            jm = mod(j-1,1:Nz); jp = mod(j+1,1:Nz)

            # Density and velocity
            ρ = ρ_[i,j]; vx = v[i,j,1]; vz = v[i,j,2]
            ρvx = ρ*vx; ρvz=ρ*vz
            # Source term factor
            ω_ = (1-0.5*ω)*1/3
            Fx=F[i,j,1]; Fz=F[i,j,2]

            # Post collision-Streaming populations
            ρvx2 = ρ*vx*vx; ρvz2 = ρ*vz*vz; ρvxvz = vx*vz*ρ; ω__ = 1-ω
            fn[i,j,1] = ω__*f[i,j,1] + ω*4/9*(ρ                   - 1.5*(ρvx2+ρvz2) )      + 4*ω_*(        -Fx*vx-Fz*vz) # ξ_0 = (0,0)
            fn[ip, j, 2] = ω__*f[i,j,2] + ω*1/9*(ρ + 3*ρvx         + 3*ρvx2 - 1.5*ρvz2 )      + ω_*(Fx      +2*vx*Fx-Fz*vz) # ξ_1 = (1,0)
            fn[i_, j, 4] = ω__*f[i,j,4] + ω*1/9*(ρ - 3*ρvx         + 3*ρvx2 - 1.5*ρvz2 )      + ω_*(-Fx     +2*vx*Fx-Fz*vz) # ξ_3 = (-1,0)
            fn[i, jp, 3] = ω__*f[i,j,3] + ω*1/9*(ρ + 3*ρvz         + 3*ρvz2 - 1.5*ρvx2 )      + ω_*(Fz      +2*vz*Fz-Fx*vx) # ξ_2 = (0,1)
            fn[i, jm, 5] = ω__*f[i,j,5] + ω*1/9*(ρ - 3*ρvz         + 3*ρvz2 - 1.5*ρvx2 )      + ω_*(-Fz     +2*vz*Fz-Fx*vx) # ξ_4 = (0,-1)
            fn[ip, jp, 6] = ω__*f[i,j,6] + ω*1/36*(ρ + 3*(ρvx+ρvz) + 3*(ρvx2+ρvz2) + 9*ρvxvz ) + 0.25*ω_*(Fx+Fz +2*(vx*Fx+vz*Fz) +3*(Fx*vz+vx*Fz) ) # ξ_5 = (1,1)
            fn[i_, jp, 7] = ω__*f[i,j,7] + ω*1/36*(ρ - 3*(ρvx-ρvz) + 3*(ρvx2+ρvz2) - 9*ρvxvz ) + 0.25*ω_*(-Fx+Fz +2*(vx*Fx+vz*Fz) -3*(Fx*vz+vx*Fz) ) # ξ_6 = (-1,1)
            fn[i_, jm, 8] = ω__*f[i,j,8] + ω*1/36*(ρ - 3*(ρvx+ρvz) + 3*(ρvx2+ρvz2) + 9*ρvxvz ) + 0.25*ω_*(-Fx-Fz +2*(vx*Fx+vz*Fz) +3*(Fx*vz+vx*Fz) ) # ξ_7 = (-1,-1)
            fn[ip, jm, 9] = ω__*f[i,j,9] + ω*1/36*(ρ + 3*(ρvx-ρvz) + 3*(ρvx2+ρvz2) - 9*ρvxvz ) + 0.25*ω_*(Fx-Fz +2*(vx*Fx+vz*Fz) -3*(Fx*vz+vx*Fz) ) # ξ_8 = (1,-1)
        end
    end
    return nothing
end


#   Compute moments (ρ, v)
#   ––––––––––––––––––––––––
function cpu_comp_ρ_v!(ρ, v, f, F)
    for i in 1:Nx, j in 1:Nz
        @inbounds ρ[i, j] = f[i, j, 1] + f[i, j, 2] + f[i, j, 3] + f[i, j, 4] + f[i, j, 5] +
            f[i, j, 6] + f[i, j, 7] + f[i, j, 8] + f[i, j, 9]
        @inbounds v[i, j, 1] = (f[i, j, 2] - f[i, j, 4] + f[i, j, 6] - f[i, j, 7] -
            f[i, j, 8] + f[i, j, 9] + 0.5*F[i,j,1])/ρ[i,j]
        @inbounds v[i, j, 2] = (f[i, j, 3] - f[i, j, 5] + f[i, j, 6] + f[i, j, 7] -
            f[i, j, 8] - f[i, j, 9] + 0.5*F[i,j,2])/ρ[i,j]
    end
    return nothing
end



#   Stencil
#   =========

@inline function stencil_4(dist)
    if dist<1.
        return 0.125*(3-2*dist+sqrt(1+4*dist-4*dist*dist))
    elseif dist<2.
        return 0.125*(5-2*dist-sqrt(-7+12*dist-4*dist*dist))
    else 
        return 0.0
    end
end

#   Interpolate the fluid velocity to the object nodes
#   ====================================================
function cpu_interpolate!(boundary, n_nodes, v, Nx, Nz, stencil)
    for i = 1:n_nodes 
    
        boundary[i,5] = 0
        boundary[i,6] = 0

        x_int = trunc(Int, boundary[i,1])   
        y_int = trunc(Int, boundary[i,2])  

        xs = x_int-1; xe = x_int+2; ys = y_int-1; ye = y_int+2

        for x=xs:xe
            for y=ys:ye
                xx = mod(x,1:Nx)
                yy = mod(y,1:Nz)

                wx = stencil(abs(x+0.5 - boundary[i,1]))
                wy = stencil(abs(y+0.5 - boundary[i,2]))

                boundary[i,5] += v[xx,yy,1]*wx*wy
                boundary[i,6] += v[xx,yy,2]*wx*wy
            end
        end
    end
    return nothing
end

#   update node positions
#   =======================
function cpu_update_particle_position!(boundary, n_nodes, Nx, Nz)
    for i = 1:n_nodes         
        boundary[i,1] = boundary[i,1] + boundary[i,5]
        boundary[i,2] = boundary[i,2] + boundary[i,6]
    end
    return nothing
end

#   compute node forces based on the object's deformation
#   =======================================================
function cpu_comp_particle_forces!(boundary, n_nodes, Nx)
    for i = 1:n_nodes    
        area = Nx / n_nodes

        boundary[i,7] = -boundary[i,9] * (boundary[i,1] - boundary[i,3]) * area
        boundary[i,8] = -boundary[i,9] * (boundary[i,2] - boundary[i,4]) * area
    end
    return nothing
end

#   spread forces to the fluid lattice
#   ====================================
function cpu_spread!(boundary, F, Nx, Nz, n_nodes, stencil)
    for i = 1:n_nodes
        
        x_int = trunc(Int, boundary[i,1])
        y_int = trunc(Int, boundary[i,2])

        xs = x_int-1; xe = x_int+2; ys = y_int-1; ye = y_int+2

        for x=xs:xe
            for y=ys:ye
                xx = mod(x,1:Nx)
                yy = mod(y,1:Nz)

                wx = stencil(abs(boundary[i,1] - 0.5 - x))
                wy = stencil(abs(boundary[i,2] - 0.5 - y))

                F[xx,yy,1] += boundary[i,7]*wx*wy
                F[xx,yy,2] += boundary[i,8]*wx*wy
            end
        end
    end
    return nothing
end


#   Building object
#   ≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡
function cpu_wall!(wall, n_nodes, begx, begz, endx, endz, stiff, Nx, Nz)
    for i = 1:n_nodes
        posx = begx + (i-1)*(endx-begx)/n_nodes
        posx = mod(posx, Nx)
        posz = begz + (i-1)*(endz-begz)/n_nodes
        posz = mod(posz, Nz) 
        wall[i,1] = posx
        wall[i,3] = posx
        wall[i,2] = posz
        wall[i,4] = posz
        wall[i,9] = stiff
    end
    return nothing
end


#   Poiseuille
#   ≡≡≡≡≡≡≡≡≡≡≡≡
#   Parameters
#   ============

wall_num_nodes = Nx*10
wall_stiffness = 1.0 # IBM spring constant

NΔt = 10000

τ=sqrt(3/16)+0.5      # relaxation time (BGK model) 
ω=1/τ

vx_max=0.001            # maximum velocity
ν=(2*τ-1)/6           # kinematic shear viscosity

∇P = Tf(8*ν*vx_max/(Nz^2))  # Pressure gradient

#   Initialization
#   ================
f = zeros(Tf, Nx, Nz, 9)
f_t = zeros(Tf, Nx, Nz, 9)
ρ = ones(Tf, Nx, Nz)
v = zeros(Tf, Nx, Nz, 2)
F = zeros(Tf, Nx, Nz, 2)

boundary = zeros(Tf, wall_num_nodes, 9)

#   Time loop
#   ===========
cpu_initialize_pop!(f, w)
cpu_wall!(boundary, wall_num_nodes, 0.0, Nz*0.5, Nx-0.5, Nz*0.5, wall_stiffness, Nx, Nz)
for t=0:NΔt
    if t%100==0
        print(t,", ")
    end
    @. F[:,:,1] = ∇P
    @. F[:,:,2] = 0.0
    cpu_interpolate!(boundary, wall_num_nodes, v, Nx, Nz, stencil_4)
    cpu_update_particle_position!(boundary, wall_num_nodes, Nx, Nz)
    cpu_comp_particle_forces!(boundary, wall_num_nodes, Nx)
    cpu_spread!(boundary, F, Nx, Nz, wall_num_nodes, stencil_4)
    cpu_collide_stream!(f_t, f, F, ρ, v, ω, Nx, Nz)
    @. f = f_t
    cpu_comp_ρ_v!(ρ, v, f, F)
end

err = std(boundary[:,2])
println("Error: ", err)

I obtain an error of 0.0 on cpu, of order 1e-6 on K620, and 1e-3 on A100…
I tried many things, my next step will be to write the same conde using CUDA C or pycuda to see if I have the same problem.
I have absolutely no idea of where the mistake come from, and my ignorance in gpu computing is probably the main issue.
Thank you for you help,
If you see something wrong it would be incredibly nice.

Ludovic_Dumoulin · March 23, 2022, 2:35pm

I tried on P100, the error is about 1e-5.
To sum up:

cpu: err = 0.0
K620: err \sim 1e-6
P100: err \sim 1e-5
A100: err \sim 1e-3

Whereas there is nothing stochastic in my simulation, the error is random…
That makes things even weirder.

Ludovic_Dumoulin · March 24, 2022, 10:11am

Hello,
Using a mixed CPU/GPU code, I identified the problem:

function kernel_spread!(boundary, F, Nx, Nz, n_nodes, stencil)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    
    if i > n_nodes
        return nothing
    end
    
    x_int = trunc(Int, boundary[i,1])
    y_int = trunc(Int, boundary[i,2])

    xs = x_int-1; xe = x_int+2; ys = y_int-1; ye = y_int+2
    
    for x=xs:xe
        for y=ys:ye
            xx = mod(x,1:Nx)
            yy = mod(y,1:Nz)

            wx = stencil(Float64(abs(boundary[i,1] - 0.5 - x)))
            wy = stencil(Float64(abs(boundary[i,2] - 0.5 - y)))
            
            F[xx,yy,1] += boundary[i,7]*wx*wy #### HERE !
            F[xx,yy,2] += boundary[i,8]*wx*wy #### HERE !
        end
    end
    return nothing
end

Some threads edit the same cell of the CuArray (for example F[1,1,1]).

It happens because each thread of index i add a float to an array at indices \{i-2,..,i+2\}, then the more a graphic card can run threads in parallel, the more there are conflicts between threads.

Is there a solution to solve this conflict ?

maleadt · March 24, 2022, 12:35pm

Yeah, don’t do that A quick fix is to add CUDA.@atomic in front of those additions. That will make it slower, though. A better solution is to compute interim sums, aggregate those across the block, and then perform atomic additions at the grid level; but that’s of course much more invasive. Depending on the exact characteristics, CUDA.@atomic might perform well enough for you.

Ludovic_Dumoulin · March 25, 2022, 5:36pm

Thank you so much!
Everything is working well now !

Topic		Replies	Views
Performance comparison of Nvidia A100, V100, RTX2080Ti Performance gpu , cuda	17	5313	June 14, 2021
Bug with Julia 1.7.1 and CUDA 3.3 GPU bug , cuda	26	2398	June 2, 2022
Accessing array elements too slow? GPU	10	592	April 23, 2021
Correct utilisation of CUDA kernel for simulations GPU	16	569	February 13, 2025
I don't understand why it is slower with CuStaticSharedArray New to Julia gpu , cuda , sharedarrays , cudajl	9	282	March 17, 2025

Simulation running well on K620 but instable on A100?

Related topics