Again on reaching optimal parallel scaling

This is somewhat vague, and quite certainly not Julia specific, so any bibliography, reference, experience, is appreciated.

I am benchmarking an implementation of cell lists to compute particle interactions and other properties. In the example bellow I’m computing the Lennard-Jones potential for a random set of particles, ranging from 10^4 to 8\times 10^7 particles.

Of course, scaling cannot be perfect for every system. An ideal scaling would one for which one can achieve a linear speedup when linearly increasing, at the same, time, the system size. That is, if one has a linear speedup for N particles with 8 processors, one should have linear speedup for 2N particles with 16 processors.

In my problem, I am getting fairly linear speedups independently of problem size up to 8 processors, and then the scaling starts to be dependent on the size. For about 8\times 10^6 particles (the yellow curve), scaling is still linear up to 32 processors. What one would like, or expect, is that the red curves followed progressively the dashed line (ideal scaling) more and more, implying that good scaling is obtained for more processors with larger systems.

image

Of course this is not happening, and for some reason large systems start to be penalized even with 16 processors (red curves deviate from the dashed line). The total memory of the node is far greater than the memory requirements of the problem, so that is not an issue. Also, the computation is completely non-allocating.

I imagine that concurrent memory accesses are ultimately limiting the scalability (there are no locks or atomic operations either, by the way). But I don’t really know how to profile that.

Any hint on how to explore this problem further is appreciated. I have read everything @tkf wrote already (sorry for the pinging) and I think I’m following the advice he gives in his notes, at least the ones I understand.

Anyway, any experience or speculation is welcome, to give me a leading route to explore this further.

One thing that currently doesn’t scale well in Julia is GC. Julia uses a stop the world approach and mostly single threaded GC, so if you are at 1% GC time on 1 thread, by 32 threads that ends up taking almost 1/4th of your time.

5 Likes

Yes, I agree with Oscar_Smith that it’d be useful to look at %GC time.

Also, It’d be nice if you can share the benchmark code.

Other random shots:

If the inefficiency is coming from the scheduler, I sometimes see JULIA_EXCLUSIVE=1 helps a bit.

If your algorithm has some β€œquiescent moments” where there are no parallel tasks for a short amount of time (e.g., you have outer serial loops), increasing sleep threshold sometimes helps too. In a recent-ish 1.8-DEV you can try, say, JULIA_THREAD_SLEEP_THRESHOLD=1000000000 (spin for 1 second before sleep).

To investigate this further, maybe you’d need to use Profile.jl. There are some bug fixes in 1.7 for profiling multi-threaded code so it’d be useful to upgrade Julia (though maybe you have upgraded it already). It’s sometimes helpful to include the C function. If some C functions (e.g., multiq_deletemin) from the scheduler are in the top of the list, you know that it’s the scheduler’s fault. Maybe you can increase the base case size if that’s the case.

Another way to profile the program is to look at the performance counters using https://github.com/JuliaPerf/LinuxPerf.jl (or just perf stat command) if you use Linux. A possibly relevant counter is the last level cache (LLC) misses. There’s https://github.com/tkf/BenchPerf.jl/tree/master/examples/sum_gather (and also ../parallel_sum_gather) [1] that is an example for playing with the performance counter (based on CppCon 2017: Chandler Carruth β€œGoing Nowhere Faster” - YouTube) although it’s not documented at all (and still WIP). If you have more LLC misses with more threads, you might need to use some techniques to reuse caches, if applicable to your program. I don’t know exactly what CellListMap.jl does, but I wonder if it’s helpful to get some inspiration from the cache-oblivious algorithm for n-body simulation (e.g., CppCon 2014: Pablo Halpern "Decomposing a Problem for Parallel Execution" - YouTube).

PS: Please feel free to ping me :slight_smile: I’m always curious how other people do parallel programming in Julia!


  1. Checking out the repository and running cd examples; make Manifest.toml; cd sum_gather; make benchmark; make report.ipynb should work…, in principle. β†©οΈŽ

6 Likes

Indeed, that seems to be an important part of the problem!

The calculation has two parts, constructing the cell lists, and computing the pairwise interactions given the cell lists (similar to what neighbor lists algorithms do with ball trees).

Building the cell lists is of course allocating, but the mapping can be allocation free (depending on the function to be mapped, but in this case it is).

The second part is most times by far the most expensive one. For instance, this is what I get with one thread:

  • Time for building the cell lists:
julia> t1
BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min … max):  3.738 s …    4.139 s  β”Š GC (min … max): 3.11% … 5.38%
 Time  (median):     3.938 s               β”Š GC (median):    4.30%
 Time  (mean Β± Οƒ):   3.938 s Β± 284.052 ms  β”Š GC (mean Β± Οƒ):  4.30% Β± 1.60%

  β–ˆ                                                        β–ˆ
  β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆ ▁
  3.74 s         Histogram: frequency by time         4.14 s <

 Memory estimate: 1.59 GiB, allocs estimate: 100413.
  • Time for computing the function:
julia> t2
BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 54.880 s (0.00% GC) to evaluate,
 with a memory estimate of 3.80 KiB, over 34 allocations.

GC time is, thus, 5% of the time required for building the cell lists, but I was not caring too much about that because the second part is much slower anyway.

The construction of the cell lists is not very easy to parallelize, because the time required for reduction is comparable to the time of the computations, even if using a tree-based asynchronous reduction (sorry if the terminology is nonsense, but I guess you understand what I mean).

I don’t have quick access to the 128-core machine (it takes 3 days for anything start running), but in a 16 core machine I can test things quickly I get:

  • Building the cell lists:
julia> t1
BenchmarkTools.Trial: 3 samples with 1 evaluation.
 Range (min … max):  2.196 s …    2.499 s  β”Š GC (min … max): 56.57% … 51.40%
 Time  (median):     2.219 s               β”Š GC (median):    55.98%
 Time  (mean Β± Οƒ):   2.305 s Β± 168.488 ms  β”Š GC (mean Β± Οƒ):  52.53% Β±  3.54%

  β–ˆ   β–ˆ                                                    β–ˆ
  β–ˆβ–β–β–β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆ ▁
  2.2 s          Histogram: frequency by time          2.5 s <

 Memory estimate: 3.81 GiB, allocs estimate: 1722817.
  • Mapping the function:
julia> t2
BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min … max):  3.439 s …    3.618 s  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     3.528 s               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   3.528 s Β± 126.171 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆ                                                        β–ˆ
  β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆ ▁
  3.44 s         Histogram: frequency by time         3.62 s <

 Memory estimate: 55.86 KiB, allocs estimate: 400.

The scaling of the second part is almost perfect: 54.88/3.44 = 15.95 (for 16 threads).

The first part has two issues: first, with perfect scaling it should take 0.23s, but even if we discount 52% GC time, that would be about 1s. I don’t know if I can improve that much further.

But with 52% GC time that went to 2s! And that with 16 threads. I guess that GC time with 64-128 threads is exploding… I will check that, and that might explain at least part of the slowdowns.

If I run with 8 threads, I get for the first part 1.343s and 31%GC time. Thus, if the GC time is increasing in that pace (roughly proportional to the number of threads), one would expect that GC completely dominates the time for the construction of the cell lists and since the other part is fast, the complete calculation.

So, preliminary conclusion: focus on the effect on GC on the parallelization of the construction of the cell lists.

Thanks very much @tkf also for the references, they will certainly be helpful. I think the toughest challenge is to come up with an efficient parallel way to construct the cell lists, because the computation is cheap and necessarily require allocating stuff, but it seems that to scale things further that has to be sorted out, because the second part appears to be running fast enough in comparison.

By the way, benchmark code I’m running is this one:

Code
using CellListMap                                                                                                                      using FastPow
using BenchmarkTools

ulj(d2,u) = @fastpow u += 4*(1/d2^6 - 1/d2^3)

function scaling(N=10^6;parallel=true)
    nthreads = Threads.nthreads()
    GC.gc()
    # setup points (irrelevant)
    t0 = @benchmark CellListMap.xatomic($N)
    x, box = CellListMap.xatomic(N)
    # Create cell lists (minor... or not)
    t1 = @benchmark CellList($x,$box,parallel=$parallel)
    umap(x,y,i,j,d2,u) = ulj(d2,u)
    cl = CellList(x,box,parallel=parallel)
    # Pairwise computation (expensive)
    t2 = @benchmark map_pairwise($umap, 0., $box, $cl, parallel=$parallel)
    return t0, t1, t2
end

# the times reported are obtained by running:
t0, t1, t2 = scaling(8*10^6)
# where t1 is the benchmark of the construction of the cell lists and t2 is the time
# required for computing the potential

*I’m using JULIA_EXCLUSIVE=1 in these tests.

1 Like

FWIW, there also is GitHub - JuliaPerf/LIKWID.jl: Julia wrapper for the performance monitoring and benchmarking suite LIKWID. which has at least some documentation. :slight_smile:

2 Likes

Isn’t it the question of BLAS that I signaled in the similar thread a few months ago? In addition, I am recalling that JULIA_EXCLUSIVE=1 trick was inferior to HyperThreading with the increase of problem computational intensity at least on some of the latest x86 CPUs.

I don’t think so, I’m not sure now if I’m using BLAS at all there (there are a few linear algebra operations with small static matrices only). But I will check that again.

Overall yes, I get a better performance using multi threading, but the benchmarks become more unstable, so I decided to benchmark using it for the sake of consistency.

I am recalling that when I run profile on florpi there were some BLAS operations, however, I do not know how intense. When I see the number 8 on Julia lower than 1.8, there is always this BLAS light blinking in my head. In general, based mostly on my intuition, I would not expect full linear speedup up to the max number of cores / threads. This was also confirmed with BandwidthBenchmark.jl. Take a look at the examples below done on 2 x Xeon Gold 6128 CPU. Anyway, I am recalling that I started to prepare a short code to collect @benchmarkable CellListMap statistics in order to put them into a dataframe to easily plot comparisons of different setups. I put it away due to some reasons. I just took a look at it. I am a little bit reluctant to send it to you as this is really not that high level of coding and it is unfinished but I think the general idea might not be that bad. Maybe if running your benchmarks with physical cores, logical cores, different BLAS setups, Julia or OS affinitization would reveal some additional information. However, in general, I would follow the advice of @tkf and @carstenbauer and prepare the suite of suggested tests. One can hardly get any better advice here on those topics, however, I have to admit that the suggested scope, especially the one by @tkf is a little bit overwhelming. :slight_smile:

24 OCPUs out of MAX 24 OCPUs

                Bandwidth Scaling - pinthreads(:compact)  
               +----------------------------------------+ 
        100000 |                          .             | 
               |                          ]             | 
               |                          |,            | 
               |                         .`.            | 
               |             .      .___.| | .          | 
               |       ..   .".. ..*`   \| |.`\.        | 
               |      .` """   '"`       \ "`  \        | 
   MB/s        |    .*                         "        | 
               |    ,                                   | 
               |   /                                    | 
               |  .`                                    | 
               |  /                                     | 
               |  |                                     | 
               | |                                      | 
         20000 | |                                      | 
               +----------------------------------------+ 
                0                                     30  
                                 # cores

12 out of MAX 24 OCPUs

               Bandwidth Scaling - pinthreads(:compact)  
              +----------------------------------------+ 
        90000 |                                        | 
              |                       ,                | 
              |                       ,                | 
              |                      .                 | 
              |           ._  _r-*---,                 | 
              |         r"` ""                         | 
              |       ./                               | 
   MB/s       |      r`                                | 
              |     /                                  | 
              |    ,`                                  | 
              |   .,                                   | 
              |   /                                    | 
              |  .`                                    | 
              |  /                                     | 
        20000 |  |                                     | 
              +----------------------------------------+ 
               0                                     20  
                                # cores
1 Like

If you can pre-compute the length of all the vectors you are using or can provide a reasonable guess, I think it’s better allocating the vectors before spawning the tasks, even if it introduces some initial serial computation or waste some memory.

By the way, I don’t think the strategy used for processing and merging the cell lists is a very idiomatic task-parallel code (even though it is useful in other situations like GPU kernel programming). I’d recommend using the recursive divide-and-conquer strategy for this (or let Folds.mapreduce or @floop does it) so that you can minimize the number of tasks and synchronizations. There’s no good online tutorial explaining this that I’m aware of but maybe you can get the idea from this implementation of countmap. That said, I don’t think it is the bottleneck here yet.

I also wonder if it makes sense to use struct-of-arrays pattern (e.g., StructVector{<:Cell} instead of Vector{<:Cell}) for better access pattern. But again I don’t think it matters until reducing the %GC time.

FYI, GitHub - tkf/BenchmarkConfigSweeps.jl is a small set of utilities for helping this type of workflow.

1 Like

edit: the benchmarks I discussed here were wrong, because of wrong use of variable interpolations with @benchmark.

wrong stuff

Again thanks for the tips.

I cannot really preallocate exactly things in advance because the particles are shadowed into the boundaries to use ghost cells, and I cannot know in advance how many ghost particles there will be.

But the good thing is that I already provide means to reuse a previously allocated cell list, such that allocations are zero if the coordinates do not change (and minimal if they change, just to adapt to some possible variations). Edit: I remember now, I have tried preallocating strategies, but the computation is so cheap here that doing anything introduces an overhead. The cost of GC os a new info here.

Most interestingly, I am figuring out now, is that that I can use that feature to not preallocate, but to keep the arrays β€œalive”, thus not garbage collected, using:

x0, box0 = CellListMap.xatomic(5000) # small system, very fast
cl = CellList(x0,box0) # build cell lists for the small system
aux = CellListMap.AuxThreaded(cl) # preallocate auxiliary arrays for cell lists
x, box = CellListMap.xatomic(10^7) # much larger system
cl = UpdateCellList!(x,box,cl,aux) # build cell lists for the large system

Although the small and large systems are very different, and there will be a lot of allocations in the cell list update, the fact that they use the previous structure, coming from an outer scope, prevent them from being garbage-collected.

So, if we had before:

julia> @benchmark CellList($x,$box)
BenchmarkTools.Trial: 30 samples with 1 evaluation.
 Range (min … max):  116.259 ms … 305.339 ms  β”Š GC (min … max): 0.00% … 43.32%
 Time  (median):     151.771 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   167.708 ms Β±  53.939 ms  β”Š GC (mean Β± Οƒ):  9.80% Β± 15.52%

  β–‚ β–ˆ                                                            
  β–ˆβ–…β–ˆβ–β–…β–β–ˆβ–β–ˆβ–…β–β–β–β–ˆβ–ˆβ–ˆβ–…β–β–β–β–…β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–…β–…β–…β–β–β–β–β–…β–β–β–β–β–β–β–β–…β–…β–β–β–β–β–β–β–β–β–… ▁
  116 ms           Histogram: frequency by time          305 ms <

 Memory estimate: 404.42 MiB, allocs estimate: 121185.

now we have:

julia> x_min, box_min = CellListMap.xatomic(5000);

julia> cl0 = CellList(x_min,box_min);

julia> aux0 = CellListMap.AuxThreaded(cl0);

julia> x, box = CellListMap.xatomic(10^6);

julia> @benchmark UpdateCellList!($x,$box,cl,aux) setup=(cl=deepcopy(cl0),aux=deepcopy(aux0)) evals=1
BenchmarkTools.Trial: 45 samples with 1 evaluation.
 Range (min … max):  100.982 ms … 111.468 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     104.191 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   104.652 ms Β±   2.111 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

          ▁  ▁▁ ▁▁ β–„β–ˆ     β–ˆβ–„ ▁                                   
  β–†β–β–β–β–β–†β–†β–β–ˆβ–†β–†β–ˆβ–ˆβ–†β–ˆβ–ˆβ–†β–ˆβ–ˆβ–†β–†β–†β–β–†β–ˆβ–ˆβ–β–ˆβ–†β–†β–†β–β–β–β–β–†β–β–β–β–β–β–β–β–β–†β–β–β–β–†β–β–β–β–β–†β–β–β–β–β–β–β–† ▁
  101 ms           Histogram: frequency by time          111 ms <

 Memory estimate: 13.05 KiB, allocs estimate: 156.

I don’t really understand how allocations are being count here, because the result of both processes are the same and what is reported as allocations is very different*. But now I can see how these things go without the garbage collection. I have sent those tests to the cluster (now I have to wait a couple of days…), but that will probably improve things and localize the effect this source of problems.

*The time and allocations of the β€œpreparatory steps” do not compare at all with those of the full benchmark:

julia> @btime CellList($x_min,$box_min);
  754.277 ΞΌs (4164 allocations: 7.98 MiB)

julia> @btime CellListMap.AuxThreaded($cl0)
  1.097 ms (5656 allocations: 8.57 MiB)
CellListMap.AuxThreaded{3, Float64}
 Auxiliary arrays for nbatches = 8

Thank you! As far as I understand CellListMap, BenchmarkConfigSweeps in this case might prepare a very significant number of plots. I have to admit that what I had on my mind was more similar to the first plot @lmiq posted in this thread, however, with deeper emphasize on in depts stats and again as small number of plots as possible. As far as I understand it, the case with CellListMap seems to be that apart to different Julia setups it is also dependent on the size of the problem (number of particles). If I am recalling correctly my preliminary tests dated a few months ago, the behavior of CellListMap wrt number of particles was rather significantly different (at least then, don’t know if there are any version changes). Thanks again! I will definitely use it with some of my other projects.

Why so long? :slight_smile:

Simply because they enter into the line of the scheduler for the use of one of those machines, and the fact that my job takes only a few minutes (and not days) doesn’t matter, it is the same line.

Yes, quite a lot has changed, but on the master branch (the stable version is v0.6.7). Most of the issues of that other thread were resolved (I explain there), and for that I had to introduce some improvements and new features. The interface will break in the next release (v0.7.0), but the package will be faster and simpler, and allow for unit propagation, automatic differentiation, etc. I am accumulating changes there because that will probably become a 1.0 release.

Let me know if you might need some help with this. In terms of x86, the best what I got now in one piece is 128 logical cores. I am able to connect the machines with MPI only, as Distributed somehow does not want to work (still trying to resolve it).

1 Like

Preallocating everything, now the scaling makes sense:

image

This is reasonable, and useful if the user will keep using the same buffer (like in a simulation).

But the time for building the data structure becomes very large (much larger than the computing time) as the number of threads increase. Thus, probably there are improvements to be made in the way the data structure is structured, and built. Garbage collection can be as great as 100x times the computing time, for more than 100 threads…

2 Likes

I am glad you like it, however, I got a feeling that following some of potential @tkf’s advices could push the boundaries of time to science even further, especially that he seemed to look pretty hot about expressing them or maybe not, I do not know. BTW, do you also have a chart with deterministic timings related to the sizes of presented problems and specific info about CPU power and a will to provide such info as this could also be interesting and potentially useful.

Good to hear that you get the nice scaling! :tada:

Is the time for building the data structure (= preallocation?) excluded in the scaling plot? Or, if not, do you mean that there is a slowdown compared to the previous implementation?

FWIW, I think manual memory optimization like this should be done only if it’s the bottleneck. All that matters is to get your work done fast and well in the end. Ideally, Julia programmers don’t need to worry about this but unfortunately, we don’t have the investment of the scale of, say, Java or Go in the GC yet.

1 Like

In that plot, yes. It is relevant anyway for applications, because typically the buffer can be reused, such that the initial time for allocating stuff is diluted.

My code has a function to build the data structure from scratch, and another function to update the data structure if new coordinates are given. What I do there is to build the structure (the cell lists) from scratch first, but what I’m benchmarking there is the computation of new cell lists reusing the buffer first created, plus the mapping of the function.

The plot of the OP is a β€œbuild from scratch + mapping” benchmark, which would typically happen only once.

it is even more interesting to decompose the calcultion in each step:

Scaling of the computation the cell list from scratch (very bad).

Here is where I need to focus if I want to improve this further. Here is also where GC happens. This is less critical than it seems, because normally the second part (the mapping) takes much longer. But that is not true anymore if many cores are used.

image

Updating the cell lists:

This also scales very bad. But it is very fast as well (because the buffers are all preallocated from a previous step similar to the β€œbuild from scratch” above.

image

Clearly in from two steps above my strategy for parallelizing the construction of the cell lists is not successful.

Computing the potential

For many threads this is becoming fast and I have only one sample for each run, thus there is noise, these benchmarks must be improved. But the scaling is quite good in general. There is an expected drop for smaller systems with many threads.

image

For smaller number of threads, the times of this third plot are completely dominant, because that is the expensive part of the calculation, usually. For a larger number of threads the third becomes so fast that the other two become relevant and limiting to good scaling.

Garbage collection

It is late now, thus I may be doing something wrong, but looking at the data, GC explodes for large systems and larger number of threads:

image

Thus, I have to understand what exactly is being collected here (I think the problem is that arrays get moved in memory because they become to big for their initial allocations, thus I have to guess the sizes better to avoid that - that is even a doubt: it makes sense that if an array, when increased, does not fit in a contiguous chunk of memory it will be copied somewhere else, and then GC has to clean the original memory? If that can happen, most likely is what is going on here, because I do not have β€œlost” labels in the code).

I did notice a wrong data point with negative number of threads, to be checked…

To be more specific: I didn’t change anything relevant in the implementation. The thing is that all allocations and GC occur in the first part, which is relatively fast for small number of threads.

However, with many cores, the fist part becomes limiting, because the slow one scales really well.

What I did now is to split the fist part into two: one where allocations and GC take place, and a second that assumes the buffers are allocated.

This clearly identifies the scaling problems with the allocations and GC of the fist step. Which makes your very first hints very accurate… and that clearly gives me path to potentially improve the code.

Thanks for the clarification. So, overall, it sounds like your library scales well already in practice when used with a bit of care?

This is, by the way, why I recommend using base case size (problem size per task) as the algorithm parameter rather than the number of tasks (batches). You’d be able to hide the slowdown in small problems with a large number of threads. It also composes well when your library is used with other libraries with parallel algorithms because your library won’t waste the Julia worker threads by executing parallel tasks with low or negative benefits.

Yeah, Julia’s Array is always backed up by a contiguous (logical) memory region. So, if you want to avoid copying data when push!ing items, it may be useful to use Deque from DataStructures.jl or BlockArrays.jl that provides blockpush!. I’m not sure how much of it contributes to the %GC though. Also, the native iterate on these structures doesn’t work well and so you have to write nested loop explicitly (or use @floop) and so it increases the complexity of the code. But I think it could be a reasonable optimization.

2 Likes

It does, but it is frustrating that a test with 4 threads can be faster than one with 128, because of the cost of the initial β€œwarm up”. I have to solve that anyway.

Indeed, I think this is the key point. Since you gave me this advice another time I implemented the possibility for the user to set up the number of tasks (independently for each part of the calculation). I need a good heuristic now. I found a way to test quickly on 32+ cores, so that is the next step.

I had a hard time trying to use floops in this particular case, but because I was not able to see how to implement the flexibility I need on top of FLoops abstraction. I have to reduce complex data structures and map very general user provided functions, I couldn’t find a way to do those things.