Poor Distributed performance for independent linear algebra operators

I’m trying to run some embarrasingly parallel code using Distributed but am seeing poor scaling when scaling up the number of cores. The actual code is rather complicated but for the most part just involves a lot of linear algebra - matrix multiplication, determinants, and similar operators with relatively small (~50x50) matrices. I’ve come up with a MWE with some similar operations that mostly captures the scaling behavior (although the actual code performs slightly worse). The code runs 8 copies of a function which does a bunch of matrix multiplication, and this is spread across n cores (n = 1, 2, 4, 8).

using Distributed
using BenchmarkTools
ENV["OPENBLAS_NUM_THREADS"] = "1"

@everywhere function f(i::Int)
    for i in 1:100000
        A = rand(30, 30)
        A = A * A
    end
    return
end
@btime res = pmap(f, 1:8)

I run this on an AMD Ryzen 7 3700X 8-Core Processor (16 threads, 8 physical CPUs) using -p n (n = 1, 2, 4, 8) and get the following output:

  3.774 s (339 allocations: 13.44 KiB)
  2.239 s (344 allocations: 13.94 KiB)
  1.888 s (354 allocations: 14.94 KiB)
  1.887 s (374 allocations: 16.94 KiB)

which only constitutes a ~2x speedup for 8 cores as well as an unusual lack of speedup from 4 to 8 cores. I get roughly similar behavior running on an HPC node with an Intel Xeon Gold 6148 processor. I understand that perfect scaling should not be expected due to shared resources like RAM, cache, etc, but I am wondering if there is any way to improve this given the particular computations of interest. I don’t think this is solely the fault of Distributed, as I observe qualitatively similar slowdowns if I just run one process but with multiple instances of Julia through GNU parallel.

1 Like

30 x 30 seems like a very small problem. Do your results change if you increase the size of the matrix? Does rand thread well in that loop? What happens if you replace rand with ones?

Thanks for the suggestions. Changing rand to ones does not appear to change anything. I cranked the code up to run 1000 multiplications of 300x300 matrices gives the following performance (for 1, 2, 4, and 8 cores):

  9.603 s (339 allocations: 13.44 KiB)
  4.895 s (344 allocations: 13.94 KiB)
  2.727 s (354 allocations: 14.94 KiB)
  2.057 s (374 allocations: 16.94 KiB)

So the scaling does actually appear to improve with larger matrices (around 5x speedup for 8 cores). Unfortunately the code that I’m working with does actually involve large numbers of small matrix multiplications, so I’m hoping to find some way of improving the scaling in that limit.

Does anything change if you use BLAS.set_num_threads(1)? I haven’t seen ENV["OPENBLAS_NUM_THREADS"] = "1" before as a way to set th enumber of threads to 1 and I’m not fully convinced that it does anything.

You also might want to try using MKL which won’t help parallelism, but will likely speed things up.

No, I tried BLAS.set_num_threads(1) and nothing changes (note that I can only seem to set this if I add using LinearAlegbra, otherwise I get an error that BLAS isn’t recognized - I’m a bit confused by this since all the examples I’ve found online don’t seem to mention this dependency, but this probably isn’t significant).

Could it have to do with allocations? What happens if you do the multiplications in-place with mul!?

1 Like

OpenBLAS/README.md at 7a6a24647df49bb0797dfe3d0f43f4dc1389aa41 · OpenMathLib/OpenBLAS · GitHub But more than else the problem is setting the environment variable inside the Julia process, I’m not sure that’s useful at that point, I presume that’s only read when the library is initialised at startup.

On an AMD CPU MKL is more likely to have the opposite effect.

1 Like

As @giordano anticipated, setting the environment variable only has an effect if you do it before you start Julia. In a running session, you need to use LinearAlgebra.BLAS.set_num_threads.

Emphasising @gdalle’s point: You have many allocations here (200000, i.e. 2 per loop iteration). What scaling do you observe if you allocate A only once and reuse it in each loop iteration (with mul! from LinearAlgebra)?

BTW, which Julia version are you using?

Little benchmark on one of our compute nodes ( 2x AMD Milan 7763). I pinned the workers to cores in different memory domains.

using Distributed
using BenchmarkTools
using LinearAlgebra
using DelimitedFiles
using Plots
using ThreadPinning

BLAS.set_num_threads(1)

times = Float64[]
nw = Int[]

for n in 1:8
    addprocs(n)

    @everywhere begin
        using ThreadPinning
        w = findfirst(==(myid()), sort!(workers()))
        # !isnothing(w) && pinthread(w - 1)
        !isnothing(w) && pinthread(first.(cpuids_per_numa())[w])

        function f(i::Int)
            for i in 1:100000
                A = rand(30, 30)
                A = A * A
            end
            return
        end

    end

    t = @belapsed pmap(f, 1:8) samples = 3 evals = 1
    push!(times, t)
    push!(nw, nworkers())
    println(nworkers(), " -> ", t)
    rmprocs(workers())
end

writedlm("results.csv", hcat(nw, times, times_opt), ',')
data = readdlm("results.csv", ',')

plot(data[:, 1], data[1, 2] ./ data[:, 2]; marker=:circle, legend=false, frame=:box, xlabel="nworkers", ylabel="speedup")
plot!([1, 10], [1, 10]; ls=:dash)
savefig("speedup.png")

speedup

I’m not quite sure I understand the plateau for 3 < nworkers < 8 (or the peak at 8) right now but apart from it there is perfect scaling across memory domains.

If we pin the workers naively (to the first N cores, i.e. all in the same memory domain), we get this:
speedup

6 Likes

Thanks! This is really helpful. I would guess the plateau between 4 and 8 is because you get maximum speedup when the number of cores evenly divides 8 (although using that logic I’m surprised 3 cores does as well as it does). Indeed, I can see similar performance gains on my cluster.

Regarding allocations, cutting down on allocations by doing things in-place with mul! does appear to recover more or less a perfect scaling. This has been helpful to keep in mind with the actual code that I’m trying to optimize; I’ve had success gradually improving the scaling by reducing allocations. This wasn’t obvious to me initially because the code running on one worker doesn’t see too significant of an improvement by being deliberate about in-place operations. But it does seem to be important for the scaling, which makes sense.

2 Likes