Huge performance fluctuations in parallel benchmark: insights?

lmiq · September 16, 2021, 6:41pm

I have the plan to support GPUs, but that is not in the near future, because it is something that I am just starting to learn. Clearly the how parallel version run must change, because currently I am copying the output arrays for each thread, and this is prohibitive for the thousands of threads that GPU require. For scalar outputs it is possible that the current implementation is just one step from running on a GPU but, again, I still have to learn exactly how all that works to start really working on it.

From a first inspection, your results seem to be more stable than what I get. I have to try more carefully the JULIA_EXCLUSIVE=1 option to see if that solves, at least partially, what I’m experiencing.

I am now in the middle of some other stuff and, probably will look back into the package in two weeks or so. Probably running all those examples every time is too much carbon on the atmosphere for a regular test. But if we find something that may improve significantly the performance, we can give it a try again.

I thank you deeply again providing these tests. They are very interesting and useful to understand what we can improve here.

j_u · September 17, 2021, 10:33pm

I have the plan to support GPUs, but that is not in the near future […]

As far as I understand CellListMap package, in part you are building on top of the works of Edward Norton Lorenz and Dan Gabriel Cacuci. Are you planning maybe to support Ponte Vecchio GPU or maybe probabilistic accelerators in the medium to long-term future?

From a first inspection, your results seem to be more stable than what I get. I have to try more carefully the JULIA_EXCLUSIVE=1 option to see if that solves, at least partially, what I’m experiencing.

I have to try JULIA_EXCLUSIVE=1 more as well. One thing I noticed recently is that it seems this option does not want to work with MKL.jl and/or Distributed (tbc - this is a very preliminary observation). I am planning to read in detail information provided by @carstenbauer and some of his other posts. JULIA_EXCLUSIVE=1 was a new information for me.

I thank you deeply again providing these tests.

As this is to my interest I will read more about the topics that were mentioned in this thread. Thank you very much for rising this subject and this interesting discussion.

lmiq · September 17, 2021, 10:40pm

I do not want to claim originality by any means, but who are they? I have no idea what are these GPUs you mentioned.

I have no plans in the near future to support anything that Julia packages for GPU programing do not support. My knowledge is much below that.

Edit: that Lorenz is the famous one No, I don’t think cellistmap has much to do with his work, except that it can be used for performing complex dynamics.

j_u · September 18, 2021, 12:27am

[…] cellistmap […] can be used for performing complex dynamics.

Thank you for the confirmation on my understanding of your package.

Regarding your other questions please allow me some time to prepare answers. I will try to respond ASAP, however, it may not mean immediately.

Have a nice weekend.

j_u · September 20, 2021, 12:22pm

I have to admit that I did some additional reading on a topic of pairwise velocity distributions of galaxies as well as on a topic of BenchmarkTools.jl and BLAS libraries. The topics seem not to be trivial.

As for the pairwise velocity distributions of galaxies, I turned up to the paper by Antonaldo Diaferio and Margaret J. Geller titled: GALAXY PAIRWISE VELOCITY DISTRIBUTIONS ON NON-LINEAR SCALES [https://arxiv.org/pdf/astro-ph/9602086.pdf] as well as to youtube:

WHAT LIES BEYOND THE OBSERVABLE UNIVERSE? - WHAT LIES BEYOND THE OBSERVABLE UNIVERSE? - YouTube
Measuring the rotation of galaxies - Measuring the rotation of galaxies - YouTube
Using redshift to measure velocity of galaxies - Using redshift to measure velocity of galaxies - YouTube
Our Universe Has Trillions of Galaxies, Hubble Study - Our Universe Has Trillions of Galaxies, Hubble Study - YouTube
The Most Unusual Galaxies Ever Discovered - The Most Unusual Galaxies Ever Discovered - YouTube
Galaxies Don’t Spin The Way You Think - Galaxies Don't Rotate The Way You Think | 4K - YouTube
The Fastest Star Moves At 8% The Speed Of Light - The Fastest Star Moves At 8% The Speed Of Light - YouTube
How Fast Are You Moving Through The Universe? - How Fast Are You Moving Through The Universe? - YouTube
A JOURNEY TO INTERGALACTIC SPACE - A JOURNEY TO INTERGALACTIC SPACE - YouTube

As for the BenchmarkTools.jl, to its documentation [Manual · BenchmarkTools.jl] and for the BLAS to several threads on julia discourse out of which I decided to quote one: BLAS thread count vs Julia thread count [BLAS thread count vs Julia thread count - #11 by mbauman].

I would like to ask if there is maybe any additional suggestion regarding the proper use of Julia threads and BLAS threads as well as to Distributed. Are there maybe any additional suggestions wrt re-running the test I did a couple of days ago? In particular to the use of gcsample, samples and seconds parameters of BenchmarkTools.jl? Should I use gcsample=true parameter or not? To the options that Julia should be run (-t 112, -t 56, -t 28, -t 16, -t 8, -t 1 and JULIA_EXCLUSIVE=1 julia -t 56, -t 28, -t 16, -t 8, -t 1)? To the options that BLAS should be run (I have to admit that I am not quite sure what combinations to try)? Should there be OpenBLAS and MKL used for the test? Also wanted to ask about Octavian.jl - is it possible to use it without code adjustment, the way I understand MKL can be used?

I drafted following code:

BenchmarkTools N=10_000:N=10_000_000 (in case of gcsample=true probably to only 1_000_000)

using BenchmarkTools
btime = @benchmarkable CellListMap.florpi(N=1_000_000,cd=false,parallel=true) gcsample=true samples=1000 seconds = 7200
run(btime)

Also I should be able to prepare numerical data in dataframe format and some plots. The data I plan to collect is minimum(btime), median(btime), mean(btime), maximum(btime), std(btime).

I am finding it particularly interesting due to the fact that there seem to be some differences not only wrt CellListMap.jl but also wrt AlphaZero.jl. As for AlphaZero.jl I saw huge differences when running it on CPU only machines. What is interesting wrt AlphaZero.jl is that it is possible to reduce training time to “only” about 1h on a machine with 56 cores and with the use of Distributed and MKL for the first phase, however, the next 3 phases are taking 8 hours and it seems that only 1 thread is utilized in full.

Despite the fact that coding is not my area of expertise, I am interested in those topics as I would like to understand it better. May I ask if there is maybe any arbitrary opinion on this topic or maybe any comprehensive / exhaustive blog post, please?

lmiq · September 20, 2021, 6:23pm

I am really not qualified to answer most of your question (if any). One thing you can do is to try to focus on any of them, and split it into another thread. Other people may then see it and provide some proper advice.

j_u · September 21, 2021, 8:42am

I understand. I understand as well that this is the reason you raised the question. As far as I know, design of languages and scheduling processors of different kinds are not trivial things. Nothing to be shy. One thing, I think can be done, is to prepare some data in an easy to analyze form for a potential review by people with deep knowledge in those areas with a hope of additional advice. As for now, I was focusing on BenchmarkTools.jl, however, I spotted that @profile is revealing the use of BLAS by your package (which is a similar situation to the one encountered with AlphaZero.jl thus I allowed myself to mention this package as well). I think that BLAS was not mentioned earlier and is potentially “something that may improve significantly the performance”. All in all, I do not want to emit too much carbon into the atmosphere by any means thus I will look forward for any potential suggestions with regard to this topic, especially that the number of pairwise velocities seems to be growing.

lmiq · December 1, 2021, 3:35pm

An update to this topic, which may be useful for someone reaching this thread.

Hardware

First, the hardware where I benchmark the code has this processor:

Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz

with 16 cores (32 threads at most).

All the examples are run with JULIA_NUM_THREADS=8, so even below the limit of the number of threads or physical processors available.

This is important because I do not see the same level of fluctuations in my personal computer (a regular Intel laptop with 4 cores / 8 threads). Thus that is somehow hardware specific.

Julia version: 1.6.3

I don’t now if anything changed in 1.7.0 that might affect this. In any case I can reproduce what I had in 1.6.3. Some initial tests appear to indicate that 1.7.0 does behave differently, but I won’t mix this here now.

1. Baseline

The typical output I had was something like this (here resulting from running 30 times the same calculation UnicodePlots ):

   [ 4.0,  6.0) ┤█▊ 1                                      
   [ 6.0,  8.0) ┤  0                                       
   [ 8.0, 10.0) ┤█▊ 1                                      
   [10.0, 12.0) ┤█▊ 1                                      
   [12.0, 14.0) ┤████████████████████████████████████  20  
   [14.0, 16.0) ┤████████████▋ 7                           
                └                                        ┘ 
                                 Frequency

As one can see, sometimes (here once) I get a 5-second run in the exact same problem in which most frequently I get a 13-second run. The 5-second time is the expected one, considering the progression of times from smaller to larger systems and the dependence of the time on the size of the problem (number of particles in this case).

Important: in this and the following case, the multi-threaded loop is structured as:

@threads for it in 1:nthreads()
    for tasks in splitter(...) # the workload doesn't differ much
         # run task 
    end
end

Thus, the number of batches is equal to the number of threads available (in this case, 8 ).

2. JULIA_EXCLUSIVE=1

Start Julia with:

JULIA_EXCLUSIVE=1 julia-1.6.3/bin/julia -t 8

This has a great impact on the fluctuations and on the performance:

   [5.0, 5.5) ┤██▌ 1                                     
   [5.5, 6.0) ┤████████████████████████████████████  14  
   [6.0, 6.5) ┤  0                                       
   [6.5, 7.0) ┤███████▋ 3                                
   [7.0, 7.5) ┤██████████████████████████████▊ 12        
              └                                        ┘ 
                               Frequency

now we got much closer to the correct result in most runs. Still, there are fluctuations and many times the time is not ideal.

3. Using @spawn

Now I changed the loop structure to:

@sync for it in 1:nbatches
    @spawn for task in splitter()
        # run task
    end 
end

The point here is that now I will have the flexibility to increase the number of batches, and use possible free threads. Using nbatches=8 (thus nothing new for the moment), and NOT using JULIA_EXCLUSIVE=1, there is already an improvement relative to (1), though the fluctuations are huge:

   [ 4.0,  6.0) ┤█████████████████████▋ 6                  
   [ 6.0,  8.0) ┤█████████████████████████▎ 7              
   [ 8.0, 10.0) ┤████████████████████████████████████  10  
   [10.0, 12.0) ┤██████████████████  5                     
   [12.0, 14.0) ┤███▋ 1                                    
   [14.0, 16.0) ┤███▋ 1                                    
                └                                        ┘ 
                                 Frequency

4. Increasing `nbatches`

Increasing the number of batches (here to 32), without JULIA_EXCLUSIVE=1 provides a huge performance benefit in the average run:

   [4.0, 4.5) ┤████████  4                               
   [4.5, 5.0) ┤████████████████████████████████████  18  
   [5.0, 5.5) ┤████  2                                   
   [5.5, 6.0) ┤  0                                       
   [6.0, 6.5) ┤████████  4                               
   [6.5, 7.0) ┤████  2                                   
              └                                        ┘ 
                               Frequency

The worse runs (those with 14-16 seconds) disappear from list.

5. Back with `JULIA_EXCLUSIVE=1`, with nbatches=8

With julia exclusive, but with a small number of batches, there is an improvement relative to (3), which shows that the it is not the unbalance workload between threads which causes the greater times (14s), but the uneven times obtained from different threads.

   [4.0, 4.5) ┤████████████████████████████████████  13  
   [4.5, 5.0) ┤████████████████████████▉ 9               
   [5.0, 5.5) ┤  0                                       
   [5.5, 6.0) ┤  0                                       
   [6.0, 6.5) ┤█████▌ 2                                  
   [6.5, 7.0) ┤████████▍ 3                               
   [7.0, 7.5) ┤████████▍ 3                               
              └                                        ┘ 
                               Frequency

6. JULIA_EXCLUSIVE=1 and nbatches=32

              ┌                                        ┐ 
   [4.0, 4.2) ┤█▋ 1                                      
   [4.2, 4.4) ┤████████████████████████████████████  21  
   [4.4, 4.6) ┤█▋ 1                                      
   [4.6, 4.8) ┤███▌ 2                                    
   [4.8, 5.0) ┤█████▎ 3                                  
   [5.0, 5.2) ┤███▌ 2                                    
              └                                        ┘ 
                               Frequency

With async spawn, 32 threads, and JULIA_EXCLUSIVE, all calculations fall into the "correct* time.

Conclusions

The original problem is associated with the unbalanced time of each thread, but not because the workload differs too much between tasks (the workload is very homogeneous, indeed). For some unknown reason, some tasks sometimes take much more time to run, causing the overall slowdown in the original implementation. JULIA_EXCLUSIVE=1 greatly impact that random uneven performance of the threads, in such a way that the problem is greatly minimized. However, optimal performance is obtained only if that is combined with using more threads that can be run asynchronously in any available thread, and that can be achieved by reducing the size of the batches.

I do not observe all these variations in every hardware I have access to.

As of CellListMap, the package concerned, version 0.7 will be a breaking release to appear soon, incorporates the multi-threading changes and the possibility of tuning the number of batches on specific calculations, to improve the overall parallel performance.

I really appreciate all the help received here, the tips and suggestions, which allowed me to improve the implementation and understand the problem.

Fastest possible

For this test case, which is a 3M-particle calculation, the fastest possible peformance is obtained using, in this machine, julia -t32 and nbatches=2^13=8192, for which we have (exclusive does not aid here):

             ┌                                        ┐
   [1.8, 2.0) ┤████████████████████████████████████  14 
   [2.0, 2.2) ┤█████████████████████████████████▌ 13    
   [2.2, 2.4) ┤█████▎ 2                                 
   [2.4, 2.6) ┤██▌ 1                                    
              └                                        ┘
                               Frequency

dlakelan · December 1, 2021, 3:48pm

It’s important to understand how “hyperthreads” work on the CPU. Basically it’s like this (on linux) if you have 16 cores then cpu 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 are all on separate cores, then 16 is the same core as 0, 17 is the same as 1 etc up to 31.

The way hyperthreading works is that let’s say something is running on cpu 0 and another thing on cpu 16… these are on the same physical core, so only one of them can run at any given time. If whatever’s running on CPU 0 stalls out because it misses a cache hit and needs to load something from RAM, then and ONLY THEN does the process on CPU 16 run where it might have a chance to use its cache… this runs until it stalls out and then it switches back… So hyperthreading is just a way to interleave calculations when one thread would otherwise be stalled (this is the high level view).

When you do JULIA_EXCLUSIVE then it schedules your processes on cores 0…7, there is no contention, things don’t run in serial, and you get reliable timing.

In general hyperthreads are useful, but not so much for scientific computing, and you shouldn’t consider them as “additional cpus” more like “a way to get slightly more responsive user interface while still running hard core computations on the EXCLUSIVE CPUs”

What I like to do on my desktop machine which has 6 cores and 12 threads, is to set JULIA_NUM_THREADS=5 and JULIA_EXCLUSIVE=1 and then Julia can crank away on MCMC or Optimization or whatever, and my user interface still has 1 full core to run on so it doesn’t feel like my computer is stalled.

lmiq · December 1, 2021, 3:59pm

Yes, I understand that now (and of course that is part of the issue here). Given that, it actually surprises me that these problems are not much more common to observe.

Perhaps the system tries to distribute the threads among different CPUs by default?

This is not only about Julia. Many other packages (molecular simulation packages, as NAMD) have the possibility of controlling the thread affinity, but I do not observe in general much benefit in doing so, which I would expect if it was common that two threads got competing for the same CPU resource.

Also it is interesting, in this case, that increasing the granularity of the batches (reducing the workload per batch) has a similar effect on performance. Having some thread to stall seems to be rare such that this has an important performance effect, though common enough to be seen in almost every run at least once. With the asynchronous threading of small batches when that happens its effect is minimized by the migration of the workload to other cpus.

dlakelan · December 1, 2021, 4:05pm

There may be another aspect of this which favors somewhat smaller batches, which is that as far as I understand the GC has to stop all the threads, so every time you GC it must wait until a thread is in a pausable state, then pause that thread, one after another, until all the threads are paused, and THEN it can GC everything. If each task goes into hard-core computing for long periods of time (say 1 or 5 seconds), then you may experience rather long times where nothing other than 1 or 2 threads is computing, in preparation for a GC. So it will be very inefficient. On the other hand, if the threads each do some work for say 10 to 50 ms and then get into a pausable state, then it doesn’t require pausing all the threads for seconds at a time and can be more efficient.

It may also help somehow to within a computation every so often yield() (though I’m not exactly sure how yield interacts with threads). If it works as I’m thinking, this yield can create a more fine grained opportunity for GC.

lmiq · December 1, 2021, 4:24pm

That is an aspect I didn’t consider, although everything there is non-allocating, so as far as I understand GC should not run?

As a side question, does it make any sense to use JULIA_EXCLUSIVE=1 and run with a number of threads equal to the number of virtual cores? In this specific application the fastest performances are obtained using the virtual cores.

For my own records, for -t16, that is, nthreads() = number of CPUs, the effect of the number of batches is more important that the thread affinity:

With julia -t16, nbatches=64

              ┌                                        ┐ 
   [2.0, 2.5) ┤██▊ 1                                     
   [2.5, 3.0) ┤████████████████████████████████████  13  
   [3.0, 3.5) ┤█████▌ 2                                  
   [3.5, 4.0) ┤  0                                       
   [4.0, 4.5) ┤██▊ 1                                     
   [4.5, 5.0) ┤██▊ 1                                     
   [5.0, 5.5) ┤███████████████████████████▋ 10           
   [5.5, 6.0) ┤█████▌ 2                                  
              └                                        ┘ 
                               Frequency

With JULIA_EXCLUSIVE=1 julia -t16, nbatches=64 (maybe a small improvement)

              ┌                                        ┐ 
   [2.0, 3.0) ┤████████████████████████████████████  15  
   [3.0, 4.0) ┤██▍ 1                                     
   [4.0, 5.0) ┤██████████████▍ 6                         
   [5.0, 6.0) ┤███████████████████▎ 8                    
              └                                        ┘ 
                               Frequency

With JULIA_EXCLUSIVE=1 julia -t16, nbatches=32 (some outliers appear)

              ┌                                        ┐
   [2.0, 3.0) ┤████████████████████████████████████  13 
   [3.0, 4.0) ┤███████████▏ 4                           
   [4.0, 5.0) ┤  0                                      
   [5.0, 6.0) ┤██▊ 1                                    
   [6.0, 7.0) ┤███████████▏ 4                           
   [7.0, 8.0) ┤  0                                      
   [8.0, 9.0) ┤██████████████████████▎ 8                
              └                                        ┘
                               Frequency

With JULIA_EXCLUSIVE=1 julia -t16, nbatches=256

   [2.2, 2.4) ┤█████████████████████████████████████  9  
   [2.4, 2.6) ┤████████▎ 2                               
   [2.6, 2.8) ┤████████████████████████████████▊ 8       
   [2.8, 3.0) ┤████▎ 1                                   
   [3.0, 3.2) ┤████████████████████████████████▊ 8       
   [3.2, 3.4) ┤████████▎ 2                               
              └                                        ┘

julia -t32, nbatches=256 (similar)

              ┌                                        ┐ 
   [2.0, 2.2) ┤████████████████████████████████████  10  
   [2.2, 2.4) ┤██████████████▍ 4                         
   [2.4, 2.6) ┤███▋ 1                                    
   [2.6, 2.8) ┤███████▎ 2                                
   [2.8, 3.0) ┤██████████▊ 3                             
   [3.0, 3.2) ┤████████████████████████████▊ 8           
   [3.2, 3.4) ┤███████▎ 2                                
              └                                        ┘ 
                               Frequency

julia -t32, nbatches=2^13 (fastest). For this 3M-particle problem. Increasing to 2^14 slows down.

   [1.8, 2.0) ┤████████████████████████████████████  14 
   [2.0, 2.2) ┤█████████████████████████████████▌ 13    
   [2.2, 2.4) ┤█████▎ 2                                 
   [2.4, 2.6) ┤██▌ 1                                    
              └                                        ┘
                               Frequency

dlakelan · December 1, 2021, 4:28pm

In the end whether hyperthreading helps I think it will depend on specifics of the calculation you’re doing, how often it has cache misses etc. best to just test it out.

Topic		Replies	Views
Help wanted: benchmarking multi-threaded CPU performance Offtopic hardware	20	954	May 13, 2024
What is julia doing with your threads? General Usage	23	1151	February 21, 2024
More threads, slower code, even if not spawning them Performance	19	839	January 29, 2022
Scaling of @threads for "embarrassingly parallel" problem Performance threads	29	1994	January 20, 2023
Decrease in performance using Threads.@threads in Linux Julia at Scale	16	1994	July 23, 2019