What is the correct way of using MKL.jl on Julia Version 1.7.0-beta3.0 (2021-07-07) wrt the number of distributed processes and multithreading?
I do reinforcement learning (AlphaZero.jl) training on Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz (this is 6 physical cores per socket with 2 sockets which means 12 physical and 24 logical cores).
I was able to use MKL.jl without any problems when starting julia -t 12 and not using distributed processing, however the calculations were rather very slow. I saw %Cpu(s) at about 54.8 us, %CPU was about 2144% and RAM utilization was very low in comparison to some of my fastest distributed calculations.
When doing tries with Distributed, addprocs(12) I encountered several problems mostly related to:
From worker 8: OMP: Error #34: System unable to allocate necessary resources for OMP thread:
From worker 8: OMP: System error #11: Resource temporarily unavailable
From worker 8: OMP: Hint: Try decreasing the value of OMP_NUM_THREADS.
Thus I would like to ask for an advice.
How should I start julia?
- julia -t 12 or
- julia -t 24 or
- julia -p 12 -t 12 or
- julia -p 24 -t 24
How many distributed processes should I add?
- using Distributed; addprocs(12)
- using Distributed; addprocs(24)
How should I be using MKL?
- using MKL
- @everywhere using MKL
How many BLAS threads should I be setting?
- BLAS.set_num_threads(24) [AFAIK MKL is operating on physical cores so this should be rather incorrect]
How should I be using LinearAlgebra?
- using LinearAlgebra
- @everywhere using LinearAlgebra
I would like to take maximum advantage of all of the FLOPS on the machine. I would appreciate some advice / information on this topic, even some hints would be appreciated. Below I am enclosing BLASBenchmarksCPU.jl charts if this might be of any use [julia started as julia -t 24].