Why does increasing the number of identical parallel operations increases each operation time?

matthiasbe · March 12, 2021, 11:08am

Hi,

I am investigating on the scaling performance of my distributed linear algebra application. For this purpose I made very basic examples of identical operations done in parallel.

In the following example, I don’t understand why the execution time of each operation is increasing even though the number of flops is the same. Is there something I missed for the cluster configuration ? Is there some multi-threading setting to turn off ? (I tried LinearAlgebra.BLAS.set_num_threads(1) without success.)

Having a constant execution time is important for me to time correctly my scaling experiments.

Every operation has the same amount of flops.
There is the same amount of operations than the number of cores subscribed for the program.

Example 1 : Matrix multiplication

I execute the following code with a different amount of processes

res = [@spawnat p @elapsed rand(5000,5000) * rand(5000,5000) for p in workers()]
display(mean(map(fetch, res)))

test_matrix_mult

Example 2: Matrix permutation

res = [@spawnat p @elapsed permutedims(rand(10000,10000)) for p in workers()]
display(mean(map(fetch, res)))

test_permutation

Additional information

I’m running this on a node of a SLURM managed cluster. It is a 32 cores node (2x Cacade Lake Intel Xeon 5218, 16 cores, 2.4GHz).
The initialisation of the processes is done with ClusterManager.jl through a sbatch subscription as follows:

using ClusterManagers
using Distributed

ncores = parse(Int, ENV["SLURM_NTASKS"])
addprocs_slurm(ncores)

If you need any extra information I would be happy to provide them.

Thank you for you help.

lmiq · March 12, 2021, 12:07pm

Part of the problem may be that rand accesses variables which are shared. Independently of other problems, it might be a good idea to first generate the matrices, and then benchmark the computations with them (or, alternatively, generate non-random matrices with independent computations on each thread).

stevengj · March 12, 2021, 1:03pm

Memory contention. If you are running on a shared-memory machine, then all the cores are all accessing the same main memory (and may also share some caches), which slows them down as you add more processes.

(You shouldn’t see this on problems that are not memory-bound. For example, try a problem where each processor is just running a long expensive loop but is not accessing any large array or allocating any memory.)

I’m not sure what you’re referring to?

lmiq · March 12, 2021, 1:30pm

I am not sure either, but I have experienced problems with multi-threading and random number generators before, and I have seen this arising repeatedly here. I think the problem can be associated with the fact that if one wants to generate a single sequence of random numbers, all threads will be accessing the same seed sequentially, thus there is a problem there if the generation of the random numbers is a limiting step. Anyway, I am not sure when this actually applies, and when it does not.

Here is one post which I could find referring to this kind of problem:

stevengj · March 12, 2021, 1:33pm

That’s not how rand works in Julia; it’s generating “independent” pseudorandom streams (i.e. starting from different seeds) on different processes or threads, so no synchronization is involved.

matthiasbe · March 12, 2021, 1:50pm

I have redone the experiments with the random generation beforehand:

For the first example I replaced with the following code

@everywhere A = rand(10000,10000)
@everywhere B = rand(10000,10000)

res = [@spawnat p @elapsed A*B for p in workers()]
times = map(fetch, res)
println(mean(times))

but the resulting times are still increasing:

test_matrix_mult

For the example 2 the code becomes

@everywhere A = rand(10000,10000)

res = [@spawnat p @elapsed permutedims(A) for p in workers()]
times = map(fetch, res)
println(mean(times))

and the obtained times are constant (0.6 second for 1,…32 processes). I don’t know if the scaling issue came from the RNG or from something else though.

Regarding exemple 1: from the documentation I think the node has a shared L3 cache for all the cores. I’m surprised that this can affect the performance that much.

lmiq · March 12, 2021, 2:00pm

I created a new thread here about the possible issues with the random number generator, which may or may not be related to what you are seeing. I am not sure either. Let us see what we learn there

Topic		Replies	Views
Distributed performance depends on the number of workers? General Usage package , parallel , cluster , distributed , slurm	0	87	June 11, 2024
Poor Distributed performance for independent linear algebra operators Performance	9	462	January 10, 2024
Poor performance multiplying many (large) matrices multithreaded Performance question , linearalgebra	11	2472	July 13, 2020
Multi-threaded matrix exponential slower than the single threaded version Performance multithreading , linearalgebra	7	1001	December 1, 2020
High performance vector/matrix/tensor linear algebra operations Performance question , performance , linearalgebra	9	541	January 20, 2023

Why does increasing the number of identical parallel operations increases each operation time?

Example 1 : Matrix multiplication

Example 2: Matrix permutation

Additional information

Related topics