Bitten by BLAS threading. Again. (Interaction w/ MPI.)

PetrKryslUCSD · August 23, 2024, 1:18pm

I ran some MPI codes. On a Mac 2 Ultra (16 performance cores), and on a Grace+Grace system (two sockets, 72 cores each).

The runs were carried out with 4 processes on a single machine (node). There was a weird mismatch of the performance of the processes (which were balanced rather well). Tracking it down showed that BLAS took ALL available threads in each of the processes.
And then probably did not share equitably. No wonder there were no threads left to actually run most of the processes.

The number of BLAS threads had to be manually set to some reasonable number, and the program then ran well.

The same program was simultaneously tested on Linux machines (Xeon and AMD Opteron). The problems with BLAS did not show there.

I think the lesson might be: always be aware of multithreading in the libraries your code uses.

goerz · August 24, 2024, 7:50am

Coming from an HPC background, I can’t understand how automatic threading was ever considered as a default. It’s one of the truly bad decisions that were made for Julia (although unlike some other warts, this one could be fixed, and might not even be considered a breaking change).

My recommendation is for people to define environment variables in their .bashrc or equivalent file to disable all automatic threading

stevengj · August 24, 2024, 2:19pm

I seem to recall that it was debated for precisely this reason, though I can’t find those debates right now — it’s not composable to assume that your process can take over the entire machine. On the other hand, if we don’t default to a multi-threaded BLAS we get a never-ending stream of complaints from users that we are slower than Matlab and Numpy. I think multicore by default was chosen because the HPC users tend to be sophisticated enough to figure out disabling threading, whereas enabling it would be a big hurdle for the common case of an ordinary user (who is not used to thinking about parallelism) running a single julia process.

At least we have a easy way to turn off multithreading in Julia. On the Python side, we’ve had difficulty using jax in MPI projects because each process wants to grab all the cores and there are only partially functional hacks to disable this (Limit jax multithreading · Issue #743 · google/jax · GitHub).

Note that there is some possibility of improving this further to avoid oversubscribing by default in typical HPC scenarios where process affinity is set by the queueing system: [LinearAlgebra] Initialise number of BLAS threads with `uv_available_parallelism` by giordano · Pull Request #55574 · JuliaLang/julia · GitHub

Topic		Replies	Views
BLAS thread count vs Julia thread count General Usage question , performance , linearalgebra	21	2768	April 6, 2021
Environment variable for BLAS.set_num_threads default? Performance blas , hpc , parallel	1	1562	September 2, 2021
How to prevent BLAS from thrashing with Julia? General Usage parallel	5	2191	May 30, 2017
Pmap and multi-threaded BLAS Performance blas , parallel	2	959	November 29, 2019
BLAS fails in Julia's multithreaded mode with too many threads General Usage question , blas , hpc	4	1367	February 15, 2017

Bitten by BLAS threading. Again. (Interaction w/ MPI.)

Related topics