Different calculation results when using CPU vs. GPU with CUDA.jl

Hi everyone,
I am rather new to GPU programming and I am having difficulty understanding why I get different results when running a function on GPU vs. CPU. The function involves atomic operation, and even running it on GPU several times, the solution is not exactly the same. The difference in the results (between CPU and GPU or between multiple runs on GPU) is very small (within the last decimal digits). This discrepancy may seem very small, however the calculated density is the input to another function and these functions are all iterating within a loop which lead to divergence of discrepancy after a while.
So my question is if this discrepancy is because of the inherent precision of GPU calculation or I am using the atomic macro incorrectly. I defined all variables as Float64 both on GPU and CPU.

The following code includes CPU and GPU functions (denCalc_CPU & denCalc_GPU) which calculate the density on a 1D grid by depositing the weight of “macroparticles” to the grid using linear interpolation implemented in the scatter_CPU or scatter_GPU functions.

Also, I was wondering if there are any difference between using atomic operation as CUDA.@atomic vs. CUDA.atomic_add! which are included in scatter_GPU function?

Many thanks

using CUDA
using Random

x_min = 0.0
x_max = 0.067
num_nodes = 128
num_mp = Int(1e7)
densiy = 2e14

Δh = (x_max - x_min) / (num_nodes - 1)
node_vol = Δh * ones(Float64, num_nodes)
mpwt = (x_max - x_min) * densiy / num_mp
node_vol[1] *= 0.5
node_vol[end] *= 0.5

den = zeros(Float64, num_nodes)
pos = zeros(Float64, num_mp)

Random.seed!(1234)
for i in 1:num_mp
    pos[i] = x = x_min[1] + rand() * (x_max[1] - x_min[1])
end

den_dev = CuArray(den)
node_vol_dev = CuArray(node_vol)
pos_dev = CuArray(pos)

function pos_to_logical(x, x_min, Δh)
    lc = (x - x_min) / Δh
    cellIndex = floor(Int, lc) + 1
    posInCell = lc - Float64(cellIndex) + 1.0
    return cellIndex, posInCell
end

function scatter_GPU(field, var, i, di)
    CUDA.@atomic field[i] += var * (1.0 - di)
    CUDA.@atomic field[i+1] += var * (di)
    # CUDA.atomic_add!(pointer(field, i), var * (1.0 - di))
    # CUDA.atomic_add!(pointer(field, i + 1), var * (di))
end

function scatter_CPU(field, var, i, di)
    field[i] += var * (1.0 - di)
    field[i+1] += var * (di)
end

function denCalc_kernel(den, pos, mpwt, x_min, Δh)

    thread = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = gridDim().x * blockDim().x
    for i = thread:stride:length(pos)
        if i <= length(pos)
            @inbounds begin
                cellIndex_x, posInCell_x = pos_to_logical(pos[i], x_min, Δh)
                scatter_GPU(den, mpwt, cellIndex_x, posInCell_x)
            end
        end
    end
    return
end

function denCalc_GPU(den, pos, mpwt, node_vol, x_min, Δh)
    den .= 0.0
    num_threads = 512
    num_blocks = ceil(Int, length(pos) / num_threads)

    CUDA.@sync begin
        @cuda threads = num_threads blocks = num_blocks denCalc_kernel(den, pos, mpwt, x_min, Δh)
    end
    den ./= node_vol
end

function denCalc_CPU(den, pos, mpwt, node_vol, x_min, Δh)
    den .= 0.0
    @inbounds for i in eachindex(pos)
        cellIndex_x, posInCell_x = pos_to_logical(pos[i], x_min, Δh)
        scatter_CPU(den, mpwt[i], cellIndex_x, posInCell_x)
    end
    den ./= node_vol
end

denCalc_GPU(den_dev, pos_dev, mpwt, node_vol_dev, x_min, Δh)
denCalc_CPU(den, pos, mpwt, node_vol, x_min, Δh)

Very small discrepancies could be caused by the non-deterministic order of iterating through the particles when running in parallel. Floating point addition is not associative. Atomic operations do not help there, they only ensure that all the terms are included. The discrepancy should be visible also in the CPU version if you randomly reorder the particles.

1 Like

Thank you for your reply. It makes perfect sense!