Lack of improvement from distributed pmap, understanding a simple example

ma-y · October 28, 2024, 3:14pm

Dear all,

I am relatively new to Julia and have spent the last weeks trying to parallelize my code, without much success. Therefore I would like to understand the following minimal example.
My computer has 16 cores, and I will write down a very simple and parallelizable task, which does not scale well to even 8 workers. The question is why. My real problem is much more complicated, involved the creation of many arrays, and runs much longer, but since I can’t even scale the simple example well, I want to understand it first.
Say I have a function called heavy_task that I would like to execute 160 times:

using Distributed
addprocs(8) #compare 1, 2, 4, 8, 16
workers()

@everywhere function heavy_task(i)
    b = inv(rand(2000, 2000))
    return b[1, 1]
end

I aim to do this with pmap and measure the time:

@time res = pmap(i -> heavy_task(i), 1:160)

The timing I get is:

# 1 worker: 71s
# 2 worker: 42s 
# 4 worker: 28s
# 8 workers: 23s
# 16 workers: 25s

My question is why? I tried to make it as simple and parallel as can be, yet the speedup is so minimal. What am I doing wrong?

Thank you so much for your help and many greetings!

eldee · October 28, 2024, 6:28pm

Hi, and welcome to the Julia community!

First of all, if you want to run your code on a single computer, I would recommend multithreading over (distributed) multiprocessing, as this should have less overhead. If you aren’t aware, check out (the documentation on) @threads, Tasks, and packages such as OhMyThreads.jl.

At least in your simple example it’s important to note that inv is already multithreaded.

julia> using BenchmarkTools

julia> M = rand(2000, 2000);

julia> @btime inv($M);
  190.308 ms (5 allocations: 31.51 MiB)

julia> using LinearAlgebra; BLAS.get_num_threads()
4

julia> BLAS.set_num_threads(1)

julia> @btime inv($M);
  378.011 ms (5 allocations: 31.51 MiB)

If you have 16 cores/threads, you would then expect to get an improvement only up to 4 workers. If you have 32 threads, possibly up to 8.

On my 4C/8T machine, with BLAS.set_num_threads(1) (@everywhere) I get as timings

1 worker:  66 s
2 workers: 39 s
4 workers: 26 s
8 workers: 25 s

while with BLAS.set_num_threads(4) (@everywhere) this becomes

1 worker:  36 s
2 workers: 29 s
4 workers: 33 s
8 workers: 36 s

ma-y · October 28, 2024, 6:40pm

Thank you so much for your kind reply, this is very helpful.

If we consider the setup with BLAS.set_num_threads(1) (@everywhere), what is the reason that we can’t scale efficiently to 4 workers and that there are almost no gain from going from 4 to 8?

Is it because it’s a memory bound problem or is there something else I am missing (communication is deliberately kept to a minimum in the example, but I do need to create arrays in the problem that matters for me)?

Noel_Araujo · October 29, 2024, 11:33am

Hello, from my experience, the problem of scaling rises from the CPU bandwidth utilization. Linear Algebra operations can easy use all bandwidth, and I usually experience some (maybe linear) scaling only using different nodes/computers, or, with matrices sizes that were “smal” - “big” and “smal” are problem and hardware dependent.

On a single computer, I had to balance between number of process, number of blas threads, and matrices sizes. Maybe all that you need is a single process, running all cores to solve a large matrix, than trying to solve many matrices in parallel.

Just some extra comment, you can setup the number of threads and blas when creating the process

using Distributed
addprocs(4, 
	exeflags = `--threads 4`, 
	enable_threaded_blas=true)

sgaure · October 29, 2024, 12:07pm

You can measure the memory bandwidth requirements for a single process with LIKWID.jl or LinuxPerf.jl. Then compare it to the memory bandwidth of your computer. Also, some multicore CPUs come with some “performance cores” and … whatever the other cores are. But advertise them just as a bunch of cores.

In general, linear algebra done by BLAS (which most languages do), will easily utilize most resources in the cpu, so things like virtual cores/threads etc. may not work very well.

ma-y · October 29, 2024, 2:24pm

Thank you very much, this is very helpful for me!

ma-y · October 29, 2024, 2:25pm

Thank you so much, this is very helpful for me, I will try to measure the bandwidth as you said.

Topic		Replies	Views
Struggling with pmap New to Julia parallel	8	1002	September 5, 2019
Multithreading and pmap Julia at Scale	8	2735	January 5, 2019
Pmap usage Performance question , parallel	1	361	December 13, 2020
Why is the parallel map so slow? General Usage parallel , optimization , pmap	2	3219	May 10, 2020
Pmap use of processor cores Julia at Scale question , pmap , load-balancing	13	2186	June 12, 2019

Lack of improvement from distributed pmap, understanding a simple example

Related topics