I am investigating on the scaling performance of my distributed linear algebra application. For this purpose I made very basic examples of identical operations done in parallel.
In the following example, I don’t understand why the execution time of each operation is increasing even though the number of flops is the same.Is there something I missed for the cluster configuration ? Is there some multi-threading setting to turn off ? (I tried LinearAlgebra.BLAS.set_num_threads(1) without success.)
Having a constant execution time is important for me to time correctly my scaling experiments.
Every operation has the same amount of flops.
There is the same amount of operations than the number of cores subscribed for the program.
Example 1 : Matrix multiplication
I execute the following code with a different amount of processes
res = [@spawnat p @elapsed rand(5000,5000) * rand(5000,5000) for p in workers()]
display(mean(map(fetch, res)))
Example 2: Matrix permutation
res = [@spawnat p @elapsed permutedims(rand(10000,10000)) for p in workers()]
display(mean(map(fetch, res)))
Additional information
I’m running this on a node of a SLURM managed cluster. It is a 32 cores node (2x Cacade Lake Intel Xeon 5218, 16 cores, 2.4GHz).
The initialisation of the processes is done with ClusterManager.jl through a sbatch subscription as follows:
using ClusterManagers
using Distributed
ncores = parse(Int, ENV["SLURM_NTASKS"])
addprocs_slurm(ncores)
If you need any extra information I would be happy to provide them.
Part of the problem may be that rand accesses variables which are shared. Independently of other problems, it might be a good idea to first generate the matrices, and then benchmark the computations with them (or, alternatively, generate non-random matrices with independent computations on each thread).
Memory contention. If you are running on a shared-memory machine, then all the cores are all accessing the same main memory (and may also share some caches), which slows them down as you add more processes.
(You shouldn’t see this on problems that are not memory-bound. For example, try a problem where each processor is just running a long expensive loop but is not accessing any large array or allocating any memory.)
I am not sure either, but I have experienced problems with multi-threading and random number generators before, and I have seen this arising repeatedly here. I think the problem can be associated with the fact that if one wants to generate a single sequence of random numbers, all threads will be accessing the same seed sequentially, thus there is a problem there if the generation of the random numbers is a limiting step. Anyway, I am not sure when this actually applies, and when it does not.
Here is one post which I could find referring to this kind of problem:
That’s not how rand works in Julia; it’s generating “independent” pseudorandom streams (i.e. starting from different seeds) on different processes or threads, so no synchronization is involved.
I have redone the experiments with the random generation beforehand:
For the first example I replaced with the following code
@everywhere A = rand(10000,10000)
@everywhere B = rand(10000,10000)
res = [@spawnat p @elapsed A*B for p in workers()]
times = map(fetch, res)
println(mean(times))
but the resulting times are still increasing:
For the example 2 the code becomes
@everywhere A = rand(10000,10000)
res = [@spawnat p @elapsed permutedims(A) for p in workers()]
times = map(fetch, res)
println(mean(times))
and the obtained times are constant (0.6 second for 1,…32 processes). I don’t know if the scaling issue came from the RNG or from something else though.
Regarding exemple 1: from the documentation I think the node has a shared L3 cache for all the cores. I’m surprised that this can affect the performance that much.
I created a new thread here about the possible issues with the random number generator, which may or may not be related to what you are seeing. I am not sure either. Let us see what we learn there