We see then that floating point operations are managed slightly differently on the cpu and gpu. Is it something do to Julia (CuArrays?) or does it boil down to the hardware?
This is likely just that floating point math isn’t associative, so re-ordering the computations can produce different results. Specifically, GPUs batch operations, which can change the results. CPUs also do because of simd instructions, but they do so differently, so the results can end up different.
The CPU and GPU may also have floating point registers of different sizes. Desktop intel hardware has 80-bit wide registers with guard digits. Server class chips like Xeons and Operons have 128 bit or larger registers and can fit more than one 64bit number in there, but have no guard digits. GPUs also have not guard digits. You should see similar effects if you compare your intel laptop (has guard digits) to a server with Xeons.
Your typical Intel laptop has 256 bit registers that fit 8 Float32. Some recent laptops (Ice Lake) fit 16 Float32 per register.
80 bit registers are almost never used.
Oh, but they are! Guard digits are used by, for example, any BLAS call. Float64 and Float32 use them routinely. You would not see it in a user code, but when you accumulate sums in a register it happens automatically.
One of the easiest ways to figure out if something is going wrong is to go to higher precision.
julia> using CUDA
julia> let
x = rand(Float32, 10_000)
W = rand(Float32, 10_000, 10_000)
y = W*x
cux = CuArray(x)
cuW = CuArray(W)
cuy = cuW * cux
Array(cuy)[1:10] .- y[1:10]
end
10-element Array{Float32,1}:
-0.00048828125
0.0012207031
0.0007324219
-0.0012207031
0.0021972656
-0.00024414062
-0.0021972656
-0.00024414062
0.0
0.00024414062
whereas at 64 bit precision, we see
julia> let
x = rand(Float64, 10_000)
W = rand(Float64, 10_000, 10_000)
y = W*x
cux = CuArray(x)
cuW = CuArray(W)
cuy = cuW * cux
Array(cuy)[1:10] .- y[1:10]
end
10-element Array{Float64,1}:
1.8189894035458565e-12
-3.183231456205249e-12
7.275957614183426e-12
-1.8189894035458565e-12
-4.092726157978177e-12
-7.73070496506989e-12
4.547473508864641e-13
-2.7284841053187847e-12
1.3642420526593924e-12
9.094947017729282e-13
This strongly indicates to me that the problem is just the lack of precision in Float32. Also, you really usually want to know what the relative error is:
julia> let
x = rand(Float32, 10_000)
W = rand(Float32, 10_000, 10_000)
y = W*x
cux = CuArray(x)
cuW = CuArray(W)
cuy = cuW * cux
(Array(cuy)[1:10] .- y[1:10]) ./ y[1:10]
end
10-element Array{Float32,1}:
-1.9602805f-7
4.906113f-7
-3.875746f-7
-1.9639621f-7
-1.0653941f-6
2.9447702f-7
-2.9167163f-7
4.891585f-7
-9.862888f-8
-1.9692011f-7
So while the Float32 calculation had what looked like large differences between the GPU and CPU result, relative to the magnitude of that result, the differences are actually quite small.