I was searching for multithreading in Julia, and finally reached this interesting post.
What I found is that OpenBLAS manages its own thread pool and uses it unless we call BLAS.set_num_threads(1)
. So when we start with JULIA_NUM_THREADS=4
and call BLAS.set_num_threads(4)
, it’s not 4x4 = 16 but 4 + 4 = 8.
When I run manymul_threaded!
benchmark :
top - 00:15:38 up 65 days, 7:55, 6 users, load average: 5.44, 2.78, 1.43
Threads: 8 total, 7 running, 1 sleeping, 0 stopped, 0 zombie
%Cpu(s): 75.3 us, 24.7 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 8004416 total, 3736952 free, 3768084 used, 499380 buff/cache
KiB Swap: 32767868 total, 31780740 free, 987128 used. 3825440 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
24866 alkorang 20 0 4002112 2.950g 15396 R 64.5 38.6 4:52.49 julia
24871 alkorang 20 0 4002112 2.950g 15396 R 60.5 38.6 3:44.93 julia
24872 alkorang 20 0 4002112 2.950g 15396 R 59.0 38.6 3:44.58 julia
24868 alkorang 20 0 4002112 2.950g 15396 R 57.4 38.6 0:50.98 julia
24873 alkorang 20 0 4002112 2.950g 15396 R 55.1 38.6 3:44.77 julia
24870 alkorang 20 0 4002112 2.950g 15396 R 51.6 38.6 0:49.21 julia
24869 alkorang 20 0 4002112 2.950g 15396 R 49.2 38.6 0:51.68 julia
24867 alkorang 20 0 4002112 2.950g 15396 S 0.0 38.6 0:00.00 julia
When I run randn(5000, 5000) * randn(5000, 5000);
:
top - 00:16:41 up 65 days, 7:56, 6 users, load average: 4.49, 3.12, 1.65
Threads: 8 total, 4 running, 4 sleeping, 0 stopped, 0 zombie
%Cpu(s): 99.6 us, 0.2 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.2 si, 0.0 st
KiB Mem : 8004416 total, 3537528 free, 3963780 used, 503108 buff/cache
KiB Swap: 32767868 total, 31780752 free, 987116 used. 3627800 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
24871 alkorang 20 0 4197428 3.137g 16064 R 99.7 41.1 4:17.85 julia
24872 alkorang 20 0 4197428 3.137g 16064 R 99.7 41.1 4:16.56 julia
24866 alkorang 20 0 4197428 3.137g 16064 R 99.3 41.1 5:23.14 julia
24873 alkorang 20 0 4197428 3.137g 16064 R 98.7 41.1 4:17.31 julia
24867 alkorang 20 0 4197428 3.137g 16064 S 0.0 41.1 0:00.00 julia
24868 alkorang 20 0 4197428 3.137g 16064 S 0.0 41.1 1:18.36 julia
24869 alkorang 20 0 4197428 3.137g 16064 S 0.0 41.1 1:06.98 julia
24870 alkorang 20 0 4197428 3.137g 16064 S 0.0 41.1 1:06.37 julia
(This comes from top
command on Linux, top -H -p <pid>
)
This tells us calling a multithreaded function does not mean creating new threads, depending on the implementation.
Similarily, calling BLAS.set_num_threads(2)
does not destroy 2 threads from BLAS thread pool. It just deactivates 2 threads from computation.
BLAS.set_num_threads(2); randn(5000) * randn(5000);
:
top - 00:43:46 up 65 days, 8:23, 6 users, load average: 1.03, 0.32, 0.54
Threads: 8 total, 2 running, 6 sleeping, 0 stopped, 0 zombie
%Cpu(s): 50.3 us, 0.0 sy, 0.0 ni, 49.6 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem : 8004416 total, 3718088 free, 3781652 used, 504676 buff/cache
KiB Swap: 32767868 total, 31789756 free, 978112 used. 3809092 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
24866 alkorang 20 0 4002112 2.954g 16112 R 99.9 38.7 8:46.08 julia
24871 alkorang 20 0 4002112 2.954g 16112 R 99.9 38.7 7:39.11 julia
24867 alkorang 20 0 4002112 2.954g 16112 S 0.0 38.7 0:00.00 julia
24868 alkorang 20 0 4002112 2.954g 16112 S 0.0 38.7 1:18.36 julia
24869 alkorang 20 0 4002112 2.954g 16112 S 0.0 38.7 1:06.98 julia
24870 alkorang 20 0 4002112 2.954g 16112 S 0.0 38.7 1:06.37 julia
24872 alkorang 20 0 4002112 2.954g 16112 S 0.0 38.7 6:52.61 julia
24873 alkorang 20 0 4002112 2.954g 16112 S 0.0 38.7 6:53.96 julia
This is because OpenBLAS does not use its thread pool when we set the number of threads to 1, 4x1 = 4 threads.