Matrix multiplication with CPU and CUDA

Matrix multiplication yields slightly different results with CPU and CUDA. Is this a normal behaviour?

julia> using CUDA

julia> X = randn(Float32, 8, 16)
8×16 Array{Float32,2}:
  0.933839    0.535854   0.054124  …  -0.397162   0.795055  -0.797737
  0.0457853  -0.130413  -1.96217      -0.883076  -0.382235   0.349118
  1.09144     1.56768   -0.924807     -0.386719   1.31583    1.12212
  0.936833    0.598771   0.278092      0.184339   0.598756  -0.130373
 -0.379447   -0.326397   0.374349      0.258037  -1.08479    0.0759822
 -1.32479    -1.19413    0.124332  …   0.836484  -1.80901    0.28113
 -0.665023   -1.7046    -0.609449     -0.298331   2.9212     0.554513
  0.29803    -1.30962   -0.980602      0.469455  -0.740527  -0.967735

julia> Xn = X ./ sqrt.(sum(abs2, X; dims=1))
8×16 Array{Float32,2}:
  0.402862   0.175964   0.0216077  …  -0.267963   0.19633    -0.434993
  0.019752  -0.042825  -0.783351      -0.595807  -0.0943884   0.190368
  0.470852   0.514796  -0.369207      -0.260918   0.324928    0.611874
  0.404154   0.196625   0.111021       0.124373   0.147856   -0.0710901
 -0.163695  -0.107182   0.14945        0.174096  -0.267877    0.0414318
 -0.57152   -0.39213    0.0496365  …   0.564372  -0.446714    0.153296
 -0.286894  -0.559757  -0.243308      -0.201282   0.721356    0.302367
  0.128571  -0.430052  -0.391482       0.316739  -0.182864   -0.52769

julia> Xngpu = CuArray(Xn)
8×16 CuArray{Float32,2}:
  0.402862   0.175964   0.0216077  …  -0.267963   0.19633    -0.434993
  0.019752  -0.042825  -0.783351      -0.595807  -0.0943884   0.190368
  0.470852   0.514796  -0.369207      -0.260918   0.324928    0.611874
  0.404154   0.196625   0.111021       0.124373   0.147856   -0.0710901
 -0.163695  -0.107182   0.14945        0.174096  -0.267877    0.0414318
 -0.57152   -0.39213    0.0496365  …   0.564372  -0.446714    0.153296
 -0.286894  -0.559757  -0.243308      -0.201282   0.721356    0.302367
  0.128571  -0.430052  -0.391482       0.316739  -0.182864   -0.52769

julia> Xn'Xn
16×16 Array{Float32,2}:
  1.0         0.738857   -0.169102   …  -0.444887   0.358672   -0.161098
  0.738857    1.0         0.138181      -0.395013   0.113672    0.209446
 -0.169102    0.138181    1.0            0.550084  -0.1915     -0.245511
 -0.282271    0.0657576  -0.530405      -0.304412  -0.216795    0.710804
 -0.256974   -0.0361658   0.0866007     -0.194733  -0.218947   -0.0136854
 -0.201131   -0.43224     0.136756   …   0.558831  -0.287434   -0.610079
  0.1918     -0.0818162  -0.593462      -0.666823   0.617977    0.376696
 -0.0702096   0.261633   -0.013336      -0.299576  -0.0797219   0.234997
  0.155916    0.133923    0.613856       0.238527  -0.228438   -0.694853
  0.0748265   0.186705   -0.670178      -0.576346   0.117481    0.563981
 -0.369942   -0.404038    0.590167   …   0.522179   0.135644   -0.249161
  0.285386    0.463637   -0.246384      -0.584528   0.133876    0.339589
  0.201663    0.228354   -0.0933126     -0.332328   0.0289431   0.0490236
 -0.444887   -0.395013    0.550084       1.0       -0.564628   -0.299623
  0.358672    0.113672   -0.1915        -0.564628   1.0         0.319965
 -0.161098    0.209446   -0.245511   …  -0.299623   0.319965    1.0

julia> Xngpu'Xngpu
16×16 CuArray{Float32,2}:
  0.999695    0.738645   -0.1691     …  -0.444913   0.358664   -0.161021
  0.738645    0.999493    0.138218      -0.394975   0.113817    0.209457
 -0.1691      0.138218    0.999792       0.549904  -0.191429   -0.245223
 -0.282113    0.0656881  -0.530367      -0.304383  -0.216708    0.710804
 -0.257117   -0.0362208   0.086632      -0.194721  -0.218912   -0.0135966
 -0.201098   -0.432148    0.136684   …   0.558802  -0.287445   -0.610156
  0.191717   -0.0817629  -0.593494      -0.666916   0.617858    0.376594
 -0.0701617   0.261592   -0.0133105     -0.299631  -0.0795134   0.235098
  0.155724    0.133949    0.613852       0.238599  -0.228488   -0.69481
  0.0749329   0.186722   -0.669925      -0.576274   0.117621    0.563793
 -0.369988   -0.403988    0.590042   …   0.522104   0.135522   -0.249156
  0.285362    0.463679   -0.24624       -0.584576   0.134029    0.339626
  0.201541    0.228333   -0.0931169     -0.332243   0.0289078   0.0490151
 -0.444913   -0.394975    0.549904       1.0       -0.564707   -0.299485
  0.358664    0.113817   -0.191429      -0.564707   0.999773    0.319842
 -0.161021    0.209457   -0.245223   …  -0.299485   0.319842    1.00005

Julia and package versions:

julia> versioninfo()
Julia Version 1.5.3
Commit 788b2c77c1 (2020-11-09 13:37 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-1630 v3 @ 3.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, haswell)

(jl_2M2wMs) pkg> st
Status `/tmp/jl_2M2wMs/Project.toml`
  [052768ef] CUDA v2.4.1

Yes, that’s possible. It’s possible setting the CUDA.jl math mode to PEDANTIC_MATH gives you more accurate results, at the expense of performance, if you care.

1 Like

Thanks!