I am investigating on the scaling performance of my distributed linear algebra application. For this purpose I made very basic examples of identical operations done in parallel.
In the following example, I don’t understand why the execution time of each operation is increasing even though the number of flops is the same. Is there something I missed for the cluster configuration ? Is there some multi-threading setting to turn off ? (I tried
LinearAlgebra.BLAS.set_num_threads(1) without success.)
Having a constant execution time is important for me to time correctly my scaling experiments.
- Every operation has the same amount of flops.
- There is the same amount of operations than the number of cores subscribed for the program.
Example 1 : Matrix multiplication
I execute the following code with a different amount of processes
res = [@spawnat p @elapsed rand(5000,5000) * rand(5000,5000) for p in workers()] display(mean(map(fetch, res)))
Example 2: Matrix permutation
res = [@spawnat p @elapsed permutedims(rand(10000,10000)) for p in workers()] display(mean(map(fetch, res)))
I’m running this on a node of a SLURM managed cluster. It is a 32 cores node (2x Cacade Lake Intel Xeon 5218, 16 cores, 2.4GHz).
The initialisation of the processes is done with ClusterManager.jl through a
sbatch subscription as follows:
using ClusterManagers using Distributed ncores = parse(Int, ENV["SLURM_NTASKS"]) addprocs_slurm(ncores)
If you need any extra information I would be happy to provide them.
Thank you for you help.