BLAS performance testing for Julia 1.8

Is this documented somewhere?

What does it mean? That if I set Distributed.addprocs(n), then automatically then number of BLAS threads (for each of the distributed procs) gets set to 1?

As far as I know this is a standard (default) setting on Julia 1.7.1 and recent lower versions. As for 1.8 the last time I checked (Version 1.8.0-DEV.1177 (2021-12-25)) I was constantly hitting default OS limits (1024 processes) when starting Julia with 1 thread on 160 cores (2 x 80) machine.

Iā€™m not sure I quite understand. We now allow OpenBLAS to pick the number of threads, but that shouldnā€™t run into the process limit. Can you file an issue describing exactly what is happening?

-viral

I donā€™t think it is. Should probably be documented in the section on distributed computing in the manual.

The default OS process limit is 1024?
To explain, I work in HPC and it is common to increase limits for processes and pinned memory to large values or maybe unlimited.

@viralbshah
Im sorry for a slight delay in reply. I just did some additional testing. In the documentation for 1.7 [Distributed Computing Ā· The Julia Language] it is written: ā€œenable_threaded_blas: if true then BLAS will run on multiple threads in added processes. Default is falseā€. However when I start Julia with 1 thread and then try to add 128 procs (this is on a machine with 64 cores / 128 threads) I am hitting my default OS limit (RLIMIT_NPROC 1024).

So it seems that Julia 1.7 is defaulting to 8 BLAS threads per each process. So as I currently understand it, it might be in contrary to what is written in the documentation (not sure about it thus the question mark)?

On 1.8.0-beta1 it seems that when adding new workers, each process is defaulting to the number of logical cores on the machine and I am also hitting the limit. In order to start Julia on this machine with 64/128 procs I have to set the number of OpenBLAS threads by using: export OMP_NUM_THREADS=1 or start Julia with OMP_NUM_THREADS=1. I am not a pro developer, so I am sorry for the question, is it right / is it the recommended way?

@johnh

The default OS process limit is 1024?

Yes, it is on this OS.

To explain, I work in HPC and it is common to increase limits for processes and pinned memory to large values or maybe unlimited.

Yes, I am aware about it. I have been reading that on different Linux versions the way to change default values differ, right? Never tried to change it, is sudo needed?

However, I have to admit, that I am not sure if Iā€™d like to increase the limit. My use case is connect-four toy problem example of AlphaZero.jl (GitHub - jonathan-laurent/AlphaZero.jl: A generic, simple and fast implementation of Deepmind's AlphaZero algorithm.). The most RAM I got on a machine is rather large (256GB to 1TB), however, I guess that it might not be enough for this particular case (connect-four with more then 1024 processes).

What I was thinking about is, if it is possible to set Julia in a way that the first worker (on this particular machine) is operating on 128 BLAS threads and the other 63 or 127 workers operating only on 1 BLAS thread?

-j_u

@johnh Thx so much for the suggestion. Should you need any additional information pls let me know. @viral Thx so much for the extended BLAS capabilities. I did some additional thinking, particularly with relation to AlphaZero.jl. I will do some additional testing ASAP. Should I have any additional questions Iā€™ll allow myself to post them here.

Hey, can I ask, what is the current status related to BLAS settings for Julia 1.8? I just took a look at [https://github.com/JuliaLang/julia/blob/v1.8.0-rc3/NEWS.md#linearalgebra=] and it seems there is no info on this topic, thus allowing myself to ask this question here.

This was implemented in set default blas num threads to Sys.CPU_THREADS / 2 in absence of OPENBLAS_NUM_THREADS by Moelf Ā· Pull Request #45412 Ā· JuliaLang/julia Ā· GitHub and is backported to 1.8. Good idea that it should have a mention in NEWS. I have added the needs news label. PR to NEWS file welcome!

2 Likes

Thanks a lot for this info! I agree, I believe such information could be useful for many users, also as I understand, for users of processors without HT (i.e Neoverse, Graviton). I am always happy to help, however, I believe the documentation best is to be handled by the authors or ā€¦ the bosses [this of course is with a smile, but seriously I see that there still is a discussion on this topic taking place].

I am not sure it is a good default choice.
Many users disable HT and in the current generation (Gen 12) of Intelā€™s CPU we have more cores without HT than with.

The optimal solution would be inferring this by code.
I guess it is not trivial, hence a simpler pre defined default is chosen. Yet I think much more reasonable choice would be the number of threads the OS reports. For those with HT it will make some hit in performance, but it will negligible compared to the hit of performance for those who have disabled HT (Half performance) or have the 12 and the coming 13 generation of Intelā€™s CPUā€™s (On the coming 13900 which is 8 + 16 it means running 16 threads which might mean 8 cores are not used).

2 Likes