Bitten by BLAS threading. Again. (Interaction w/ MPI.)

I ran some MPI codes. On a Mac 2 Ultra (16 performance cores), and on a Grace+Grace system (two sockets, 72 cores each).

The runs were carried out with 4 processes on a single machine (node). There was a weird mismatch of the performance of the processes (which were balanced rather well). Tracking it down showed that BLAS took ALL available threads in each of the processes.
And then probably did not share equitably. No wonder there were no threads left to actually run most of the processes.

The number of BLAS threads had to be manually set to some reasonable number, and the program then ran well.

The same program was simultaneously tested on Linux machines (Xeon and AMD Opteron). The problems with BLAS did not show there.

I think the lesson might be: always be aware of multithreading in the libraries your code uses.

3 Likes

Coming from an HPC background, I can’t understand how automatic threading was ever considered as a default. It’s one of the truly bad decisions that were made for Julia (although unlike some other warts, this one could be fixed, and might not even be considered a breaking change).

My recommendation is for people to define environment variables in their .bashrc or equivalent file to disable all automatic threading

3 Likes

I seem to recall that it was debated for precisely this reason, though I can’t find those debates right now — it’s not composable to assume that your process can take over the entire machine. On the other hand, if we don’t default to a multi-threaded BLAS we get a never-ending stream of complaints from users that we are slower than Matlab and Numpy. I think multicore by default was chosen because the HPC users tend to be sophisticated enough to figure out disabling threading, whereas enabling it would be a big hurdle for the common case of an ordinary user (who is not used to thinking about parallelism) running a single julia process.

At least we have a easy way to turn off multithreading in Julia. On the Python side, we’ve had difficulty using jax in MPI projects because each process wants to grab all the cores and there are only partially functional hacks to disable this (Limit jax multithreading · Issue #743 · google/jax · GitHub).

Note that there is some possibility of improving this further to avoid oversubscribing by default in typical HPC scenarios where process affinity is set by the queueing system: [LinearAlgebra] Initialise number of BLAS threads with `uv_available_parallelism` by giordano · Pull Request #55574 · JuliaLang/julia · GitHub

6 Likes