Slowdown when computing eigenvalues for list of matrices with pmap

I am attempting to compute, in parallel, the eigenvalues for a list of symmetric matrices. In the code below this is achieved with pmap (using 10 workers) and sequentially (for comparison). I’ve set the number of threads used by BLAS for eigvals to 2 to match the threads used by a single core. It appears that for smaller matrix sizes, the multi-core speed-up from pmap is ~5-8x, though for the larger sizes tested here, the speed-up from pmap versus the sequential calculation basically vanishes.

What is the bottleneck for this (admittedly naïve) approach, and is there a more appropriate way to proceed with such a parallelization?

Note: This is run with OpenBlas on Julia 1.7.0-beta3.

using Distributed, Distributions
@everywhere using LinearAlgebra

BLAS.set_num_threads(2)

addprocs(10)

for m in [10, 50, 100, 200, 500, 1000, 1500, 2000]

    A = rand(Uniform(0., 1.), m, m)
    symA = Symmetric(A)
    mats=repeat([symA], length(workers()))
    
    print(m)
    
    @time pmap(eigvals, mats)
    
    @time begin
        for m in mats
            eigvals(m)
        end
    end
end

50  0.938094 seconds (1.18 k allocations: 452.578 KiB)
  0.011508 seconds (220 allocations: 774.062 KiB)
100  0.002881 seconds (1.25 k allocations: 76.531 KiB)
  0.033060 seconds (3.46 k allocations: 2.457 MiB, 22.69% gc time, 9.06% compilation time)
200  0.021528 seconds (1.35 k allocations: 92.062 KiB)
  0.045115 seconds (220 allocations: 7.556 MiB)
500  0.032562 seconds (1.25 k allocations: 137.609 KiB)
  0.249784 seconds (220 allocations: 41.750 MiB, 2.44% gc time)
1000  0.209816 seconds (1.27 k allocations: 214.516 KiB)
  1.146097 seconds (220 allocations: 159.775 MiB, 1.18% gc time)
1500  1.519743 seconds (1.27 k allocations: 294.281 KiB)
  3.110555 seconds (240 allocations: 354.096 MiB, 2.67% gc time)
2000  5.861284 seconds (1.27 k allocations: 368.609 KiB)
  6.673908 seconds (240 allocations: 624.709 MiB, 0.65% gc time)

Intuitively, I’d say it’s because the matrices have to be transferred to the worker processes, which takes time.

Does this mean that each core will use hyper threads? If so, your inlikely to see any benefit from using two threads per core, and might actually see slowdowns. One thread per physical core is often better for performance than using hyperthreads.

1 Like