Hi everyone,

I am rather new to GPU programming and I am having difficulty understanding why I get different results when running a function on GPU vs. CPU. The function involves atomic operation, and even running it on GPU several times, the solution is not exactly the same. The difference in the results (between CPU and GPU or between multiple runs on GPU) is very small (within the last decimal digits). This discrepancy may seem very small, however the calculated density is the input to another function and these functions are all iterating within a loop which lead to divergence of discrepancy after a while.

So my question is if this discrepancy is because of the inherent precision of GPU calculation or I am using the atomic macro incorrectly. I defined all variables as `Float64`

both on GPU and CPU.

The following code includes CPU and GPU functions (`denCalc_CPU`

& `denCalc_GPU`

) which calculate the density on a 1D grid by depositing the weight of â€śmacroparticlesâ€ť to the grid using linear interpolation implemented in the `scatter_CPU`

or `scatter_GPU`

functions.

Also, I was wondering if there are any difference between using atomic operation as ` CUDA.@atomic`

vs. `CUDA.atomic_add!`

which are included in `scatter_GPU`

function?

Many thanks

```
using CUDA
using Random
x_min = 0.0
x_max = 0.067
num_nodes = 128
num_mp = Int(1e7)
densiy = 2e14
Î”h = (x_max - x_min) / (num_nodes - 1)
node_vol = Î”h * ones(Float64, num_nodes)
mpwt = (x_max - x_min) * densiy / num_mp
node_vol[1] *= 0.5
node_vol[end] *= 0.5
den = zeros(Float64, num_nodes)
pos = zeros(Float64, num_mp)
Random.seed!(1234)
for i in 1:num_mp
pos[i] = x = x_min[1] + rand() * (x_max[1] - x_min[1])
end
den_dev = CuArray(den)
node_vol_dev = CuArray(node_vol)
pos_dev = CuArray(pos)
function pos_to_logical(x, x_min, Î”h)
lc = (x - x_min) / Î”h
cellIndex = floor(Int, lc) + 1
posInCell = lc - Float64(cellIndex) + 1.0
return cellIndex, posInCell
end
function scatter_GPU(field, var, i, di)
CUDA.@atomic field[i] += var * (1.0 - di)
CUDA.@atomic field[i+1] += var * (di)
# CUDA.atomic_add!(pointer(field, i), var * (1.0 - di))
# CUDA.atomic_add!(pointer(field, i + 1), var * (di))
end
function scatter_CPU(field, var, i, di)
field[i] += var * (1.0 - di)
field[i+1] += var * (di)
end
function denCalc_kernel(den, pos, mpwt, x_min, Î”h)
thread = (blockIdx().x - 1) * blockDim().x + threadIdx().x
stride = gridDim().x * blockDim().x
for i = thread:stride:length(pos)
if i <= length(pos)
@inbounds begin
cellIndex_x, posInCell_x = pos_to_logical(pos[i], x_min, Î”h)
scatter_GPU(den, mpwt, cellIndex_x, posInCell_x)
end
end
end
return
end
function denCalc_GPU(den, pos, mpwt, node_vol, x_min, Î”h)
den .= 0.0
num_threads = 512
num_blocks = ceil(Int, length(pos) / num_threads)
CUDA.@sync begin
@cuda threads = num_threads blocks = num_blocks denCalc_kernel(den, pos, mpwt, x_min, Î”h)
end
den ./= node_vol
end
function denCalc_CPU(den, pos, mpwt, node_vol, x_min, Î”h)
den .= 0.0
@inbounds for i in eachindex(pos)
cellIndex_x, posInCell_x = pos_to_logical(pos[i], x_min, Î”h)
scatter_CPU(den, mpwt[i], cellIndex_x, posInCell_x)
end
den ./= node_vol
end
denCalc_GPU(den_dev, pos_dev, mpwt, node_vol_dev, x_min, Î”h)
denCalc_CPU(den, pos, mpwt, node_vol, x_min, Î”h)
```