Not an expert on GPU but I think it might be because floating point addition is not associative in general. So maybe the CPU and GPU arrays are summed in a different order, that’s all it would take to generate slightly different answers
This is an easy experiment to convince yourself that the summation order may effect the result, without involving any GPU:
julia> for i = 1:100
x = rand(10000)
println(sum(x) - sum(reverse(x)))
As it happens the sum function for floating point numbers is much more intricate than you would naively think and it’s not at all surprising that implementations on different hardware end up with (slightly) different results.
So this makes things trickier when comparing computation using floating point numbers between CPU and GPU versions of the same code (specially when using CUDA kernel implementations)…
Guess that the only way to prevent this is to avoid Julia’s built in functions and implement everything yourself (thus having more control over the code execution order)?
It is also interesting that in the example provided by @GunnarFarneback that it shows the same variations as the example “a” and “b” above, including a new result of ±1.8189894035458565e-12 on the CPU/GPU code an ±2.7284841053187847e-12 only on the @GunnarFarneback example (CPU only).
So I’ve decided to count the variations of my examples and the one provide by @GunnarFarneback, and here are the results of an average of 3 runs of 10000 testes:
It’s not a coincidence that you see the same deviations but a consequence of how floating point numbers work or more exactly which numbers can be represented. Assume we have a sum that with some summation order yields exactly 5000. If we are slightly off we will hit these numbers: