OpenBLAS on master -- getting wrong results with multithreading?


#1

I am getting inconsistent results from multiplying the same two matrices together with OpenBLAS on Julia master:

julia> y = randn(400,400);

julia> x = randn(400,400);

julia> cond(x * y)
3108.7389106095816

julia> cond(x * y)
250095.81630383938

julia> cond(x * y) # same
250095.81630383938

julia> cond(x * y)
16627.89658977373

julia> cond(x * y) # again
250095.81630383938

julia> cond(x * y)
2631.777572954922

julia> z1 = x * y;

julia> z2 = x * y;

julia> maximum(abs, z1  .- z2)
81.02332984524476

81 is not a trivial difference!!
I understand that if threading results in adding numbers together in a different order (depending on when threads complete), that it is expected to get slightly different answers. But 81 is a very large number compared to the magnitude of entries in the matrix. argmax points to the 81, and I see a large block of incorrect answers.
When I look at , the matrix of differences, I notice that it is almost entirely 0.0.

julia> zδ[1:15,350:365]
 0.0   3.40016     5.51059   15.5744   38.3682    7.10866  38.1058    0.791143   27.7809   15.6203   13.285      1.89272   17.4851   0.0  0.0  0.0
 0.0  21.2102      0.686299   7.07578  38.1467   21.3853   13.6936   22.3834     20.9474   12.5285   19.6346     7.20672   18.6198   0.0  0.0  0.0
 0.0   5.87573     4.10078   31.6043    2.47654  19.1434   19.9269   40.6902      8.60271  17.1051   38.5678    13.5105    39.141    0.0  0.0  0.0
 0.0  29.5379     17.8031    19.7143    5.82252  12.9395   81.0233    8.24332     9.7846    6.86419  34.6936    40.1017     3.81325  0.0  0.0  0.0
 0.0   7.74442    30.1483     1.19507  19.9153   24.768     3.15936  26.7861      8.8477   25.36      4.08705   34.1401    25.1399   0.0  0.0  0.0
 0.0  25.1696      6.23755    5.07073  22.7564   22.1431    9.66137  18.8559     18.284    34.299     8.87634   11.4519    19.5225   0.0  0.0  0.0
 0.0   0.0837383  10.7374     1.09023  31.3011   12.3954   17.2506   34.3372     33.3534    5.56733   3.04148    0.278859   6.39959  0.0  0.0  0.0
 0.0  65.448      16.9498     8.4026   19.6995   11.1045   52.3888   23.052       2.5545   27.4019    2.33811    1.74398    5.8446   0.0  0.0  0.0
 0.0  31.9391      0.708929   4.41991  12.3908   10.8966    8.13072  11.3948      8.15043   6.74772  19.3858    41.8931    37.9043   0.0  0.0  0.0
 0.0   3.92232     2.29414   20.2639   30.3701   17.8275    8.54815   0.0279554   1.83999  35.9216    3.36693   33.7164    11.1285   0.0  0.0  0.0
 0.0   3.98999    17.6255    33.2399    1.99359  26.5441   24.0701   31.2736     11.5732   15.1464   23.3285    24.5417     0.35802  0.0  0.0  0.0
 0.0  22.4708      8.45022   11.09     22.609    25.0343   21.2055    9.91141    11.274     5.06574  17.0936     7.48366   36.9995   0.0  0.0  0.0
 0.0   0.112206   25.5591    20.8131    1.24503  46.9793   18.4077   24.4956      5.83157  51.9475    0.893717   5.15207   12.8931   0.0  0.0  0.0
 0.0  14.986       9.28381   16.6061   35.9068    1.0082    5.79184   9.15659     2.57992  23.2148   33.4547    11.6657    13.8442   0.0  0.0  0.0
 0.0  16.46       28.1464    35.3354    4.86666  26.4538    4.80481  36.36       37.221    27.8764   25.0792     4.25533   25.3219   0.0  0.0  0.0

Setting it to a single thread solves the problem.

julia> BLAS.set_num_threads(1)

julia> zot = x*y;

julia> maximum(abs, z1 .- zot)
84.05177973473988

julia> maximum(abs, z2 .- zot)
83.6115213251184

julia> maximum(abs, z3 .- zot)
83.6115213251184

From a few tests vs my own matmul functions, the single threaded BLAS seems to be getting the correct answer (maximum absolute difference closer to 10^-13).

MKL, and Julia 0.6 + OpenBLAS get the correct answer.
Should I file an issue with OpenBLAS on github?

I reproduced on two computers (one AMD Ryzen, the other Intel Skylake), so this seems like a cross platform problem. Please let me know if you can/cannot reproduce.


#2

What’s your exact version? Just tried with

Julia Version 0.7.0-beta2.143
Commit cb6f5e225d (2018-07-28 00:48 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: Intel® Xeon® W-2133 CPU @ 3.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.0 (ORCJIT, skylake)

and get consistent results regardless of the number of threads:

julia> using LinearAlgebra

julia> BLAS.set_num_threads(1)

julia> x=randn(400,400); y=randn(400,400); z=x*y;

julia> BLAS.set_num_threads(6)

julia> extrema([maximum(abs, z .- (x*y) .+i .-i) for i=1:1000])
(0.0, 5.684341886080802e-14)
# +i -i to prevent hoisting the result out of the loop

julia> BLAS.set_num_threads(12)

julia> extrema([maximum(abs, z .- (x*y) .+i .-i) for i=1:1000])
(5.684341886080802e-14, 1.1368683772161603e-13)

If you built from source, try doing make cleanall and make -C deps cleanall. I’ve noticed occassionally that the incremental build doesn’t always (re)compile everything needed.


#3

Okay. interesting. Would be nice if it did always recompile everything needed, because simply running git pull and rerunning make -j rarely takes notably longer than just rebuilding the system image (which I’d likely do anyway) while being a heck of a lot more convenient.

Cleaning and rebuilding can be an ordeal.
(Should have taken the opportunity to get gcc-7 for ARPACK. I’ll just wait until I can upgrade Ubuntu 16.04 -> Ubuntu 18.04.)
Cleaned everything. Error related to gmp header incompatibility.
I’m leaving for a week in <1 hour, so I don’t have time to to figure that out.
So starting over with a clean pull…

julia> using LinearAlgebra

julia> BLAS.set_num_threads(1)

julia> x=randn(400,400); y=randn(400,400); z=x*y;

julia> BLAS.set_num_threads(6)

julia> extrema([maximum(abs, z .- (x*y) .+i .-i) for i=1:1000])
(0.0, 5.684341886080802e-14)

julia> BLAS.set_num_threads(12)

julia> extrema([maximum(abs, z .- (x*y) .+i .-i) for i=1:1000])
(5.684341886080802e-14, 1.1368683772161603e-13)

julia> BLAS.set_num_threads(16)

julia> extrema([maximum(abs, z .- (x*y) .+i .-i) for i=1:1000])
(0.0, 5.684341886080802e-14)

julia> versioninfo()
Julia Version 0.7.0-beta2.164
Commit 2fb2028 (2018-07-28 12:20 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen Threadripper 1950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, znver1)

That solved it!

I’ll have to clear out my installs on all the other computers too then.
Any rough idea of how often I should start fresh instead of simply rerunning make?

Given the gmp error I got that didn’t show up when pulling, it also seems that something got left behind with cleanall, too.