Why more BLAS threads take more time

Hi, I wrote a deep learning model using MNIST data and when I ran it on PC and server respectively, I encountered some performance issues. More details are listed below.

About the devices I’m using.

  • the PC have 1 physical CPU (Intel(R) Core™ i7-9700 CPU @ 3.00GHz) , 8 Cores 8 threads,
  • the server with 2 physical CPU (Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz), each have 18 Cores 36 threads, so the total threads = 72
  • Both have enough RAM, no other tasks running, and when I test for the running time , I have jumped the first running time.

No special ENV is set when I start the julia REPL, so Threads.nthreads() = 1.

my code demo is here:

using LinearAlgebra, .....# and other packages
function train(;kws...)
   # something for my model, data and train  
# run my model the first time

# within my PC,  test the time 
for i in reverse(2:2:8)
    @time train(epoch=1)

# within Server,  test the time 
for i in reverse(5:5:30)
    @time train(epoch=1)

I recorded and graphed the running time, and then I found that

  1. the more BLAS threads I use, the longer it takes to train my model.
  2. the code running in julia version 1.6 is faster than version 1.7.

the above results came from a toy model, and then I tested with a larger model(with more params but the same input data), and I got the same result:

(sorry, the “i7-7900” should be “i7-9700” here)

By the way, I can see that CPU works in conjunction with BLAS thread count, which means that if I set BLAS threads = 30, then I will see 30 cpu cores running up to 100%, so I don’t think performance has much to do with disk I/O.

Based on the above description, any suggestions as to why my modeI takes more time when adding BLAS threads? This is the most confused question, and does the better speed in julia 1.6 than 1.7 common?

Sincerely thank you

Short answer: use MKL.jl. I’ve found very weird performance scaling with openblas (the BLAS julia ships with by default)


I tried MKL, but I still found that the more threads, the more time it took.
8 threads → 64s
6 threads → 57s
4 threads → 56s
2 threads → 55s

For anyone interested in this, I’ve pasted my toy model code here. in pastecode, which takes about 8 minutes to run in my i7 Intel CPU.