Why more BLAS threads take more time

tiZ · September 9, 2022, 9:09am

Hi, I wrote a deep learning model using MNIST data and when I ran it on PC and server respectively, I encountered some performance issues. More details are listed below.

About the devices I’m using.

the PC have 1 physical CPU (Intel(R) Core™ i7-9700 CPU @ 3.00GHz) , 8 Cores 8 threads,
the server with 2 physical CPU (Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz), each have 18 Cores 36 threads, so the total threads = 72
Both have enough RAM, no other tasks running, and when I test for the running time , I have jumped the first running time.

No special ENV is set when I start the julia REPL, so Threads.nthreads() = 1.

my code demo is here:

using LinearAlgebra, .....# and other packages
function train(;kws...)
   # something for my model, data and train  
end
# run my model the first time
train(epoch=1) 

# within my PC,  test the time 
for i in reverse(2:2:8)
    BLAS.set_num_threads(i) 
    @time train(epoch=1)
end

# within Server,  test the time 
for i in reverse(5:5:30)
    BLAS.set_num_threads(i) 
    @time train(epoch=1)
end

I recorded and graphed the running time, and then I found that

the more BLAS threads I use, the longer it takes to train my model.
the code running in julia version 1.6 is faster than version 1.7.

test800×600 28.6 KB

the above results came from a toy model, and then I tested with a larger model(with more params but the same input data), and I got the same result:

(sorry, the “i7-7900” should be “i7-9700” here)

By the way, I can see that CPU works in conjunction with BLAS thread count, which means that if I set BLAS threads = 30, then I will see 30 cpu cores running up to 100%, so I don’t think performance has much to do with disk I/O.

Based on the above description, any suggestions as to why my modeI takes more time when adding BLAS threads? This is the most confused question, and does the better speed in julia 1.6 than 1.7 common?

Sincerely thank you

antoine-levitt · September 9, 2022, 1:32pm

Short answer: use MKL.jl. I’ve found very weird performance scaling with openblas (the BLAS julia ships with by default)

tiZ · September 9, 2022, 4:14pm

I tried MKL, but I still found that the more threads, the more time it took.
8 threads → 64s
6 threads → 57s
4 threads → 56s
2 threads → 55s

For anyone interested in this, I’ve pasted my toy model code here. in pastecode, which takes about 8 minutes to run in my i7 Intel CPU.

Topic		Replies	Views
Julia Threads vs BLAS threads Internals & Design	16	11030	July 26, 2018
BLAS thread count vs Julia thread count General Usage question , performance , linearalgebra	21	2834	April 6, 2021
Regarding the multithreaded performance of OpenBLAS Performance blas , multithreading	7	5611	January 31, 2022
BLAS fails in Julia's multithreaded mode with too many threads General Usage question , blas , hpc	4	1380	February 15, 2017
Flux and cpu cores Machine Learning parallel , multithreading , flux	6	1739	September 1, 2020

Why more BLAS threads take more time

Related topics