Multithreading using more CPUs than expected

I opened a Jupyter Notebook kernel with 25 threads (threadid() prints out 25 in the beginning). I then proceed to run some code that manages to use 8900% CPU, as recorded by top on the terminal, i.e. 89 threads worth of CPU, even though I’m only asking for 25 threads. Do people have any suggestions? For reference, I just want to run a sequence of for loops over a 5 dimensional array, and I’ve been doing something like

ThreadPools.@qthreads for i in 1:6
ThreadPools.@qthreads for j in 1:4
ThreadPools.@qthreads for k in 1:5

etc. How is the system using 8900% CPU despite only having 25 threads called in the beginning?

BLAS threads (i.e. for matrix multiplication) are separate from Julia threads.

I heard you can stop BLAS from using more threads by calling julia -p 1, for instance. If I run my code without the parallelised loops, and on julia -p 1 (supposedly shutting down the BLAS extra threads), with one process and one thread, it still uses 6400% CPU. What could be going on?

You need to use LinearAlgebra.BLAS.set_num_threads(1) to set the number of BLAS threads

4 Likes

That’s wrong. Use the OPENBLAS_NUM_THREADS=1 environment variable or the interactive option that @jishnub suggested.

1 Like

For more details, check out this recent addition to the docs: Performance tips - multithreading and linear algebra

1 Like

Thanks! This worked! Wow I never knew BLAS would be calling up to 64 threads on its own.

Somehow, the problem came back, even without BLAS, and regardless of if I run the program in the terminal, or Jupyter Notebook, the CPU usage keeps blowing up, even if I only ask for 25 cores. Does anyone know what could be causing this? I am only using matrix calculations in my code, with LinearAlgebra.BLAS.set_num_threads(1) set.

The kernel appears to die immediately after the @threads call, which is really strange. If I run it on Jupyter Notebook, it says ‘the kernel appears to have died’. If I run it on the terminal, it says ‘Killed’.

Could you try using the environment variable? Does that also lead to excessive usage? Could you also print out BLAS.get_num_threads() before the threaded loop?

I eventually found the problem. Turns out that I needed to call garbage collection every loop iteration, after setting the intermediate buffer vectors to nothing.

This shouldn’t be necessary in general. Could you post a minimal example that leads to this? It sounds like a julia issue

If this makes a major difference you’re likely allocating a lot within your multithreaded tasks. Be aware that multithreading scales poorly in such cases due to single-threaded GC (at least until Julia 1.10 drops). To avoid this, you should try to allocate intermediate buffers once per task rather than once per iteration. This blog post shows one way to do it (dropping @threads in favor of upfront chunking and @spawn): PSA: Thread-local state is no longer recommended.


On a different note, be aware that ThreadPools seems to not quite have kept up with the changes required by the dynamic schedule that @threads defaults to since Julia 1.8. But this new schedule also reduces the need for ThreadPools, so I’d suggest working without it, using just @threads and/or @spawn, and only bringing back ThreadPools once you’re sure your code works, if you’re still curious about its functionality.

(Specifically, while ThreadPools.@tspawnat was fixed, I’ve noticed other places in the code where @threads is used assuming the semantics of @threads :static. I’ve been meaning to file an issue, but haven’t gotten around to it yet.)

1 Like