Pmap and svdvals from documentation speed

I am wondering why the following example of pmap isn’t getting any speedup:

using Distributed
using BenchmarkTools
@everywhere using LinearAlgebra

addprocs(length(Sys.cpu_info()));
M = Matrix{Float64}[rand(1000,1000) for i = 1:10];
@btime pmap(svdvals, M) ; # __1.931 s
@btime map(svdvals,M) ; # ____764.653 ms

This is an example taken from the Distributed computing manual. This behavior doesn’t only happen for the svdvals, but also what I actually need it for, backslash. Also, even if I were to compute the svdvals of each M[i] in a forloop, the pmap still seems to be outperformed.

Increasing the size of M doesn’t really seem to improve things. So I am wondering why this is happening and if there is a way to actually get speedup in a situation where I would need to calculate the svd, or backslash, of a matrix of the type M (or an Array{Float64,(NN,N,N)} along NN)?

Most linear algebra stuff in Julia is already multithreaded.

I understand that, but shouldn’t I see some sort of speedup if instead of

M = Matrix{Float64}[rand(1000,1000) for i = 1:10];

I increase the size to

M = Matrix{Float64}[rand(1000,1000) for i = 1:1000];

which doesn’t seem to be the case. I am in a situation where I have a bunch of large linear systems to solve. To my understanding pmap should be able to do this, but if this isn’t the case then what would be the best way of going about this? Using the MPI.jl?

Are you trying this on a multi-computer cluster? If not, I wouldn’t expect it to be faster.

I have sequential code on the a cluster, but I have not implemented the parallelization yet as I am trying to figure out a (intelligent) way of doing it. I just have been experimenting on my laptop. I should mention that part of the job of each process should also be in building the matrices, they are not random as in the example above.

But you expect that on a cluster I should be able to see some speedup?

For testing on your laptop, set BLAS threads to 1 using BLAS.set_num_threads(1). That said, if you are trying to optimize performance for a cluster, testing on a laptop will not give useful results.

I already tried using BLAS.set_num_threads(1) , but the speeds are still comparable, which is another reason why I am a little confused by this example from the manual. But seems like I should switch to testing on the cluster according to what you are saying.