Huge performance fluctuations in parallel benchmark: insights?

I have to admit that I did some additional reading on a topic of pairwise velocity distributions of galaxies as well as on a topic of BenchmarkTools.jl and BLAS libraries. The topics seem not to be trivial.

As for the pairwise velocity distributions of galaxies, I turned up to the paper by Antonaldo Diaferio and Margaret J. Geller titled: GALAXY PAIRWISE VELOCITY DISTRIBUTIONS ON NON-LINEAR SCALES [https://arxiv.org/pdf/astro-ph/9602086.pdf] as well as to youtube:

WHAT LIES BEYOND THE OBSERVABLE UNIVERSE? - WHAT LIES BEYOND THE OBSERVABLE UNIVERSE? - YouTube
Measuring the rotation of galaxies - Measuring the rotation of galaxies - YouTube
Using redshift to measure velocity of galaxies - Using redshift to measure velocity of galaxies - YouTube
Our Universe Has Trillions of Galaxies, Hubble Study - Our Universe Has Trillions of Galaxies, Hubble Study - YouTube
The Most Unusual Galaxies Ever Discovered - The Most Unusual Galaxies Ever Discovered - YouTube
Galaxies Don’t Spin The Way You Think - Galaxies Don't Rotate The Way You Think | 4K - YouTube
The Fastest Star Moves At 8% The Speed Of Light - The Fastest Star Moves At 8% The Speed Of Light - YouTube
How Fast Are You Moving Through The Universe? - How Fast Are You Moving Through The Universe? - YouTube
A JOURNEY TO INTERGALACTIC SPACE - A JOURNEY TO INTERGALACTIC SPACE - YouTube

As for the BenchmarkTools.jl, to its documentation [Manual · BenchmarkTools.jl] and for the BLAS to several threads on julia discourse out of which I decided to quote one: BLAS thread count vs Julia thread count [BLAS thread count vs Julia thread count - #11 by mbauman].

I would like to ask if there is maybe any additional suggestion regarding the proper use of Julia threads and BLAS threads as well as to Distributed. Are there maybe any additional suggestions wrt re-running the test I did a couple of days ago? In particular to the use of gcsample, samples and seconds parameters of BenchmarkTools.jl? Should I use gcsample=true parameter or not? To the options that Julia should be run (-t 112, -t 56, -t 28, -t 16, -t 8, -t 1 and JULIA_EXCLUSIVE=1 julia -t 56, -t 28, -t 16, -t 8, -t 1)? To the options that BLAS should be run (I have to admit that I am not quite sure what combinations to try)? Should there be OpenBLAS and MKL used for the test? Also wanted to ask about Octavian.jl - is it possible to use it without code adjustment, the way I understand MKL can be used?

I drafted following code:

BenchmarkTools N=10_000:N=10_000_000 (in case of gcsample=true probably to only 1_000_000)

using BenchmarkTools
btime = @benchmarkable CellListMap.florpi(N=1_000_000,cd=false,parallel=true) gcsample=true samples=1000 seconds = 7200
run(btime)

Also I should be able to prepare numerical data in dataframe format and some plots. The data I plan to collect is minimum(btime), median(btime), mean(btime), maximum(btime), std(btime).

I am finding it particularly interesting due to the fact that there seem to be some differences not only wrt CellListMap.jl but also wrt AlphaZero.jl. As for AlphaZero.jl I saw huge differences when running it on CPU only machines. What is interesting wrt AlphaZero.jl is that it is possible to reduce training time to “only” about 1h on a machine with 56 cores and with the use of Distributed and MKL for the first phase, however, the next 3 phases are taking 8 hours and it seems that only 1 thread is utilized in full.

Despite the fact that coding is not my area of expertise, I am interested in those topics as I would like to understand it better. May I ask if there is maybe any arbitrary opinion on this topic or maybe any comprehensive / exhaustive blog post, please?

2 Likes