Different results when running on CPU or GPU

natema · June 28, 2020, 4:13pm

If I do

using Flux

x = rand(Float32, 10000);
W = rand(Float32, 10000,10000);
y = W*x;

gx = x |> gpu;
gW = W |> gpu;
gy = gW*gx;

(gy[1:10] |> cpu) .- y[1:10]

I get something like

10-element Array{Float32,1}:
 -0.0014648438
  0.0014648438
 -0.0012207031
  0.0
  0.00024414062
 -0.0007324219
 -0.0017089844
  0.0007324219
  0.00024414062
  0.0026855469

We see then that floating point operations are managed slightly differently on the cpu and gpu. Is it something do to Julia (CuArrays?) or does it boil down to the hardware?

Oscar_Smith · June 28, 2020, 5:09pm

This is likely just that floating point math isn’t associative, so re-ordering the computations can produce different results. Specifically, GPUs batch operations, which can change the results. CPUs also do because of simd instructions, but they do so differently, so the results can end up different.

ctkelley · June 28, 2020, 5:26pm

The CPU and GPU may also have floating point registers of different sizes. Desktop intel hardware has 80-bit wide registers with guard digits. Server class chips like Xeons and Operons have 128 bit or larger registers and can fit more than one 64bit number in there, but have no guard digits. GPUs also have not guard digits. You should see similar effects if you compare your intel laptop (has guard digits) to a server with Xeons.

Elrod · June 28, 2020, 7:25pm

Your typical Intel laptop has 256 bit registers that fit 8 Float32. Some recent laptops (Ice Lake) fit 16 Float32 per register.
80 bit registers are almost never used.

ctkelley · June 28, 2020, 7:35pm

Oh, but they are! Guard digits are used by, for example, any BLAS call. Float64 and Float32 use them routinely. You would not see it in a user code, but when you accumulate sums in a register it happens automatically.

ctkelley · June 28, 2020, 7:42pm

Just looked at the Core I5 hardware specs. @Elrod is right. I’m an old guy and still have 8087s wired into my head.

Mason · June 28, 2020, 11:15pm

One of the easiest ways to figure out if something is going wrong is to go to higher precision.

julia> using CUDA

julia> let
       x = rand(Float32, 10_000)
       W = rand(Float32, 10_000, 10_000)
       y = W*x

       cux = CuArray(x)
       cuW = CuArray(W)
       cuy = cuW * cux

       Array(cuy)[1:10] .- y[1:10]
       end
10-element Array{Float32,1}:
 -0.00048828125
  0.0012207031
  0.0007324219
 -0.0012207031
  0.0021972656
 -0.00024414062
 -0.0021972656
 -0.00024414062
  0.0
  0.00024414062

whereas at 64 bit precision, we see

julia> let
       x = rand(Float64, 10_000)
       W = rand(Float64, 10_000, 10_000)
       y = W*x

       cux = CuArray(x)
       cuW = CuArray(W)
       cuy = cuW * cux

       Array(cuy)[1:10] .- y[1:10]
       end
10-element Array{Float64,1}:
  1.8189894035458565e-12
 -3.183231456205249e-12
  7.275957614183426e-12
 -1.8189894035458565e-12
 -4.092726157978177e-12
 -7.73070496506989e-12
  4.547473508864641e-13
 -2.7284841053187847e-12
  1.3642420526593924e-12
  9.094947017729282e-13

This strongly indicates to me that the problem is just the lack of precision in Float32. Also, you really usually want to know what the relative error is:

julia> let
       x = rand(Float32, 10_000)
       W = rand(Float32, 10_000, 10_000)
       y = W*x

       cux = CuArray(x)
       cuW = CuArray(W)
       cuy = cuW * cux

       (Array(cuy)[1:10] .- y[1:10]) ./ y[1:10]
       end
10-element Array{Float32,1}:
 -1.9602805f-7
  4.906113f-7
 -3.875746f-7
 -1.9639621f-7
 -1.0653941f-6
  2.9447702f-7
 -2.9167163f-7
  4.891585f-7
 -9.862888f-8
 -1.9692011f-7

So while the Float32 calculation had what looked like large differences between the GPU and CPU result, relative to the magnitude of that result, the differences are actually quite small.

Topic		Replies	Views
Random variations between results of CPU and GPU computation GPU	7	409	May 9, 2023
Different calculation results when using CPU vs. GPU with CUDA.jl General Usage question , gpu , cuda , atomic	2	1185	March 2, 2022
The performance difference of transferring (SubArray, ReshapedArray) Array to GPU GPU flux	2	643	November 20, 2019
Matrix multiplication with CPU and CUDA GPU question	2	743	February 1, 2021
Operation yields different results on other machine General Usage	3	384	April 17, 2021

Different results when running on CPU or GPU

Related topics