Pmap and multi-threaded BLAS

When I run a call to one of the low level LinearAlgebra.BLAS functions inside a function executed on a worker process launched via pmap it appears that BLAS does not use multiple threads. Here is a simple example:

using Distributed

@everywhere using LinearAlgebra
@everywhere function test(K)
    M = rand(K, K);
    out = BLAS.gemm('T', 'N', M, M);

pmap(test, [4096, 4096, 4096]);

Running with a single julia process (e.g. just using julia on the command line) and examining top shows that Julia is using about 750% CPU which is what I would expect given that BLAS defaults to 8 threads. However, running with julia -p 3 shows that each of the three Julia worker processes never get above 100% CPU use. Is this expected behavior? If so are there steps I can take to allow each of the worker processes to use a multi-threaded BLAS? I’m using a machine with dozens of cores so I don’t expect contention among the processes. I’m a Julia newbie so apologies if there’s a simple explanation for this.

Rather than starting with julia -p 3, start with one process and invoke addprocs(3, enable_threaded_blas=true).


Thanks - works perfectly!