Hi everyone,
I am rather new to GPU programming and I am having difficulty understanding why I get different results when running a function on GPU vs. CPU. The function involves atomic operation, and even running it on GPU several times, the solution is not exactly the same. The difference in the results (between CPU and GPU or between multiple runs on GPU) is very small (within the last decimal digits). This discrepancy may seem very small, however the calculated density is the input to another function and these functions are all iterating within a loop which lead to divergence of discrepancy after a while.
So my question is if this discrepancy is because of the inherent precision of GPU calculation or I am using the atomic macro incorrectly. I defined all variables as Float64
both on GPU and CPU.
The following code includes CPU and GPU functions (denCalc_CPU
& denCalc_GPU
) which calculate the density on a 1D grid by depositing the weight of “macroparticles” to the grid using linear interpolation implemented in the scatter_CPU
or scatter_GPU
functions.
Also, I was wondering if there are any difference between using atomic operation as CUDA.@atomic
vs. CUDA.atomic_add!
which are included in scatter_GPU
function?
Many thanks
using CUDA
using Random
x_min = 0.0
x_max = 0.067
num_nodes = 128
num_mp = Int(1e7)
densiy = 2e14
Δh = (x_max - x_min) / (num_nodes - 1)
node_vol = Δh * ones(Float64, num_nodes)
mpwt = (x_max - x_min) * densiy / num_mp
node_vol[1] *= 0.5
node_vol[end] *= 0.5
den = zeros(Float64, num_nodes)
pos = zeros(Float64, num_mp)
Random.seed!(1234)
for i in 1:num_mp
pos[i] = x = x_min[1] + rand() * (x_max[1] - x_min[1])
end
den_dev = CuArray(den)
node_vol_dev = CuArray(node_vol)
pos_dev = CuArray(pos)
function pos_to_logical(x, x_min, Δh)
lc = (x - x_min) / Δh
cellIndex = floor(Int, lc) + 1
posInCell = lc - Float64(cellIndex) + 1.0
return cellIndex, posInCell
end
function scatter_GPU(field, var, i, di)
CUDA.@atomic field[i] += var * (1.0 - di)
CUDA.@atomic field[i+1] += var * (di)
# CUDA.atomic_add!(pointer(field, i), var * (1.0 - di))
# CUDA.atomic_add!(pointer(field, i + 1), var * (di))
end
function scatter_CPU(field, var, i, di)
field[i] += var * (1.0 - di)
field[i+1] += var * (di)
end
function denCalc_kernel(den, pos, mpwt, x_min, Δh)
thread = (blockIdx().x - 1) * blockDim().x + threadIdx().x
stride = gridDim().x * blockDim().x
for i = thread:stride:length(pos)
if i <= length(pos)
@inbounds begin
cellIndex_x, posInCell_x = pos_to_logical(pos[i], x_min, Δh)
scatter_GPU(den, mpwt, cellIndex_x, posInCell_x)
end
end
end
return
end
function denCalc_GPU(den, pos, mpwt, node_vol, x_min, Δh)
den .= 0.0
num_threads = 512
num_blocks = ceil(Int, length(pos) / num_threads)
CUDA.@sync begin
@cuda threads = num_threads blocks = num_blocks denCalc_kernel(den, pos, mpwt, x_min, Δh)
end
den ./= node_vol
end
function denCalc_CPU(den, pos, mpwt, node_vol, x_min, Δh)
den .= 0.0
@inbounds for i in eachindex(pos)
cellIndex_x, posInCell_x = pos_to_logical(pos[i], x_min, Δh)
scatter_CPU(den, mpwt[i], cellIndex_x, posInCell_x)
end
den ./= node_vol
end
denCalc_GPU(den_dev, pos_dev, mpwt, node_vol_dev, x_min, Δh)
denCalc_CPU(den, pos, mpwt, node_vol, x_min, Δh)