When I run a call to one of the low level LinearAlgebra.BLAS
functions inside a function executed on a worker process launched via pmap
it appears that BLAS
does not use multiple threads. Here is a simple example:
using Distributed
@everywhere using LinearAlgebra
@everywhere function test(K)
M = rand(K, K);
out = BLAS.gemm('T', 'N', M, M);
println("Done!")
end
pmap(test, [4096, 4096, 4096]);
Running with a single julia
process (e.g. just using julia
on the command line) and examining top
shows that Julia is using about 750%
CPU which is what I would expect given that BLAS defaults to 8 threads. However, running with julia -p 3
shows that each of the three Julia worker processes never get above 100%
CPU use. Is this expected behavior? If so are there steps I can take to allow each of the worker processes to use a multi-threaded BLAS? I’m using a machine with dozens of cores so I don’t expect contention among the processes. I’m a Julia newbie so apologies if there’s a simple explanation for this.