When I run a call to one of the low level LinearAlgebra.BLAS functions inside a function executed on a worker process launched via pmap it appears that BLAS does not use multiple threads. Here is a simple example:
using Distributed
@everywhere using LinearAlgebra
@everywhere function test(K)
M = rand(K, K);
out = BLAS.gemm('T', 'N', M, M);
println("Done!")
end
pmap(test, [4096, 4096, 4096]);
Running with a single julia process (e.g. just using julia on the command line) and examining top shows that Julia is using about 750% CPU which is what I would expect given that BLAS defaults to 8 threads. However, running with julia -p 3 shows that each of the three Julia worker processes never get above 100% CPU use. Is this expected behavior? If so are there steps I can take to allow each of the worker processes to use a multi-threaded BLAS? I’m using a machine with dozens of cores so I don’t expect contention among the processes. I’m a Julia newbie so apologies if there’s a simple explanation for this.