# Different calculation results when using CPU vs. GPU with CUDA.jl

Hi everyone,
I am rather new to GPU programming and I am having difficulty understanding why I get different results when running a function on GPU vs. CPU. The function involves atomic operation, and even running it on GPU several times, the solution is not exactly the same. The difference in the results (between CPU and GPU or between multiple runs on GPU) is very small (within the last decimal digits). This discrepancy may seem very small, however the calculated density is the input to another function and these functions are all iterating within a loop which lead to divergence of discrepancy after a while.
So my question is if this discrepancy is because of the inherent precision of GPU calculation or I am using the atomic macro incorrectly. I defined all variables as `Float64` both on GPU and CPU.

The following code includes CPU and GPU functions (`denCalc_CPU` & `denCalc_GPU`) which calculate the density on a 1D grid by depositing the weight of â€śmacroparticlesâ€ť to the grid using linear interpolation implemented in the `scatter_CPU` or `scatter_GPU` functions.

Also, I was wondering if there are any difference between using atomic operation as ` CUDA.@atomic` vs. `CUDA.atomic_add!` which are included in `scatter_GPU` function?

Many thanks

``````using CUDA
using Random

x_min = 0.0
x_max = 0.067
num_nodes = 128
num_mp = Int(1e7)
densiy = 2e14

Î”h = (x_max - x_min) / (num_nodes - 1)
node_vol = Î”h * ones(Float64, num_nodes)
mpwt = (x_max - x_min) * densiy / num_mp
node_vol[1] *= 0.5
node_vol[end] *= 0.5

den = zeros(Float64, num_nodes)
pos = zeros(Float64, num_mp)

Random.seed!(1234)
for i in 1:num_mp
pos[i] = x = x_min[1] + rand() * (x_max[1] - x_min[1])
end

den_dev = CuArray(den)
node_vol_dev = CuArray(node_vol)
pos_dev = CuArray(pos)

function pos_to_logical(x, x_min, Î”h)
lc = (x - x_min) / Î”h
cellIndex = floor(Int, lc) + 1
posInCell = lc - Float64(cellIndex) + 1.0
return cellIndex, posInCell
end

function scatter_GPU(field, var, i, di)
CUDA.@atomic field[i] += var * (1.0 - di)
CUDA.@atomic field[i+1] += var * (di)
# CUDA.atomic_add!(pointer(field, i), var * (1.0 - di))
# CUDA.atomic_add!(pointer(field, i + 1), var * (di))
end

function scatter_CPU(field, var, i, di)
field[i] += var * (1.0 - di)
field[i+1] += var * (di)
end

function denCalc_kernel(den, pos, mpwt, x_min, Î”h)

stride = gridDim().x * blockDim().x
if i <= length(pos)
@inbounds begin
cellIndex_x, posInCell_x = pos_to_logical(pos[i], x_min, Î”h)
scatter_GPU(den, mpwt, cellIndex_x, posInCell_x)
end
end
end
return
end

function denCalc_GPU(den, pos, mpwt, node_vol, x_min, Î”h)
den .= 0.0
num_blocks = ceil(Int, length(pos) / num_threads)

CUDA.@sync begin
@cuda threads = num_threads blocks = num_blocks denCalc_kernel(den, pos, mpwt, x_min, Î”h)
end
den ./= node_vol
end

function denCalc_CPU(den, pos, mpwt, node_vol, x_min, Î”h)
den .= 0.0
@inbounds for i in eachindex(pos)
cellIndex_x, posInCell_x = pos_to_logical(pos[i], x_min, Î”h)
scatter_CPU(den, mpwt[i], cellIndex_x, posInCell_x)
end
den ./= node_vol
end

denCalc_GPU(den_dev, pos_dev, mpwt, node_vol_dev, x_min, Î”h)
denCalc_CPU(den, pos, mpwt, node_vol, x_min, Î”h)
``````

Very small discrepancies could be caused by the non-deterministic order of iterating through the particles when running in parallel. Floating point addition is not associative. Atomic operations do not help there, they only ensure that all the terms are included. The discrepancy should be visible also in the CPU version if you randomly reorder the particles.

1 Like