This has already been mentioned by me and @jpsamaroo in the other discourse thread you linked. Basically OPENBLAS_NUM_THREADS=1
(you have a typo there) is special because it makes OpenBLAS computations run on the respective calling (Julia) thread. for OPENBLAS_NUM_THREADS>1
the behavior changes qualitatively in the sense that OpenBLAS will create an own pool of OpenBLAS threads which it will use to run BLAS computation triggered by any of the Julia threads (there is only a single pool of OpenBLAS threads, irrespective of how many Julia threads you have). Hence, assuming Threads.nthreads() == 16
, setting OPENBLAS_NUM_THREADS=1
will, effectively, make all your BLAS computations run on all of the Julia threads (16) whereas setting OPENBLAS_NUM_THREADS=2
will make all your BLAS computations run on only 2 separate OpenBLAS threads. That’s why you see such horrible performance for your 16/8 case for example.
As for your other question, in general, multithreading your computation with Julia threads (if possible) and using OPENBLAS_NUM_THREADS=1
should be better than using only a single Julia thread and OPENBLAS_NUM_THREADS=16
. The main point is that you can parallelize your specific application much more effectively than OpenBLAS, which can only parallelize the BLAS parts. However, as with every “rule of thumb”, there are exceptions and it can depend on the computation at hand. (BTW, in your case, the rule of thumb seems to hold: compare 16/1 (538) to 1/16 (900).)