Not an expert on GPU but I think it might be because floating point addition is not associative in general. So maybe the CPU and GPU arrays are summed in a different order, that’s all it would take to generate slightly different answers

This is an easy experiment to convince yourself that the summation order may effect the result, without involving any GPU:

julia> for i = 1:100
x = rand(10000)
println(sum(x) - sum(reverse(x)))
end

As it happens the sum function for floating point numbers is much more intricate than you would naively think and it’s not at all surprising that implementations on different hardware end up with (slightly) different results.

So this makes things trickier when comparing computation using floating point numbers between CPU and GPU versions of the same code (specially when using CUDA kernel implementations)…

Guess that the only way to prevent this is to avoid Julia’s built in functions and implement everything yourself (thus having more control over the code execution order)?

It is also interesting that in the example provided by @GunnarFarneback that it shows the same variations as the example “a” and “b” above, including a new result of ±1.8189894035458565e-12 on the CPU/GPU code an ±2.7284841053187847e-12 only on the @GunnarFarneback example (CPU only).

So I’ve decided to count the variations of my examples and the one provide by @GunnarFarneback, and here are the results of an average of 3 runs of 10000 testes:

If the variations were due differences in the summed order, shouldn’t the values of the variations be more arbitrary, instead of what appears to be a pattern?

It’s not a coincidence that you see the same deviations but a consequence of how floating point numbers work or more exactly which numbers can be represented. Assume we have a sum that with some summation order yields exactly 5000. If we are slightly off we will hit these numbers:

Well, yes, but you still need to make tradeoffs between what is fast on different hardware and what is numerically good enough (depending on your needs one or more of those might be non-issues).

A better plan than to require exact equality is usually to accept small numeric deviations but design your algorithms so the deviations don’t grow out of control.