How to prevent BLAS from thrashing with Julia?

Hello.

Let’s say I’m trying to run the following function many times on multiple cores:

@everywhere function test()
  X = randn(800, 800)
  Y = randn(800, 800)
  Base.LinAlg.BLAS.axpy!(2.0, X, Y)
end

(The real function is vastly more complicated but also dominated by a BLAS call).

If I start up Julia with the some number of worker processes and run

julia> pmap(x -> test(), 1:length(workers()))

it appears to me from the CPU scaling that pmap is contending with the threads BLAS is using to run apxy!.

Even if I start up Julia with a single worker process, my eight-threaded Intel Core i7 appears to show 4 threads being used. This is also true after running

julia> BLAS.set_num_threads(1)

How do I spawn worker processes that won’t be competing with BLAS for resources?

Try setting the environment variable OPENBLAS_NUM_THREADS = 1 before launching julia

I’m surprised BLAS.set_num_threads(1) didn’t fix it. Could you post a minimal example of this case?

So here’s a REPL session followed by a screenshot of CPU activity while the REPL was churning on the last statement. Julia v0.5.2, started with one process (which agrees with Activity Monitor). There were other processes running, ofc, but I don’t think that’s what I’m seeing in the image.

I’m not sure how to check what version of BLAS Julia is using, so that’s part of my problem, too.

   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.5.2 (2017-05-06 16:34 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-apple-darwin13.4.0

julia> BLAS.set_num_threads(1)

julia> A = rand(8000, 8000);

julia> b = rand(8000);

julia> c = zeros(8000);

julia> for i = 1:100; c += A \ b; end;

Paradoxically (imo), setting OPENBLAS_NUM_THREADS=1, reopening Julia, and using the same REPL commands makes BLAS use the same four threads, but more fully.

If someone could explain what is going on here, I’d greatly appreciate it. I’m very confused.

I ran your example on Linux (unfortunately I don’t have mac to test on), and without specifying the number of threads blas should use, top reported a steady 400% CPU usage (my machine has 4 logical CPUs). With BLAS.set_num_threads(1), it was steady at 100%. Could you try using top to measure CPU usage and report the numerical value? I’m wondering if this is a problem with the measurement and not with the actual CPU usage.

versioninfo() should show which BLAS library is used. (Your banner says “Official release” so it’s presumably OpenBLAS.) It should also show your processor model, and unless you have tweaked your system I expect it will confirm that you have 4 physical cores with hyperthreading.

You don’t say how you set OPENBLAS_NUM_THREADS, but it looks like it didn’t take. You can check that by displaying ENV["OPENBLAS_NUM_THREADS"] in Julia. If it’s not set, OpenBLAS defaults to the number of physical cores on MacOS, which would explain your second chart. (This differs from Linux.) Note that CPU usage of 50% may mean full utilization - the extra virtual cores don’t have separate floating point units.

Your first chart [with BLAS.set_num_threads(1)] seems to show the scheduler migrating Julia tasks between physical processors. I think this depends on the MacOS version and platform (some systems strive to balance the load on physical processors).

Ah, I’d forgotten that most operating systems try to do this. Thanks, I think that solves my problem. Sounds like BLAS.set_num_threads is working as intended.