Again on reaching optimal parallel scaling

lmiq · December 18, 2021, 12:30am

In that plot, yes. It is relevant anyway for applications, because typically the buffer can be reused, such that the initial time for allocating stuff is diluted.

My code has a function to build the data structure from scratch, and another function to update the data structure if new coordinates are given. What I do there is to build the structure (the cell lists) from scratch first, but what I’m benchmarking there is the computation of new cell lists reusing the buffer first created, plus the mapping of the function.

The plot of the OP is a “build from scratch + mapping” benchmark, which would typically happen only once.

it is even more interesting to decompose the calcultion in each step:

Scaling of the computation the cell list from scratch (very bad).

Here is where I need to focus if I want to improve this further. Here is also where GC happens. This is less critical than it seems, because normally the second part (the mapping) takes much longer. But that is not true anymore if many cores are used.

Updating the cell lists:

This also scales very bad. But it is very fast as well (because the buffers are all preallocated from a previous step similar to the “build from scratch” above.

Clearly in from two steps above my strategy for parallelizing the construction of the cell lists is not successful.

Computing the potential

For many threads this is becoming fast and I have only one sample for each run, thus there is noise, these benchmarks must be improved. But the scaling is quite good in general. There is an expected drop for smaller systems with many threads.

For smaller number of threads, the times of this third plot are completely dominant, because that is the expensive part of the calculation, usually. For a larger number of threads the third becomes so fast that the other two become relevant and limiting to good scaling.

Garbage collection

It is late now, thus I may be doing something wrong, but looking at the data, GC explodes for large systems and larger number of threads:

Thus, I have to understand what exactly is being collected here (I think the problem is that arrays get moved in memory because they become to big for their initial allocations, thus I have to guess the sizes better to avoid that - that is even a doubt: it makes sense that if an array, when increased, does not fit in a contiguous chunk of memory it will be copied somewhere else, and then GC has to clean the original memory? If that can happen, most likely is what is going on here, because I do not have “lost” labels in the code).

I did notice a wrong data point with negative number of threads, to be checked…

Topic		Replies	Views
Scaling of @threads for "embarrassingly parallel" problem Performance threads	29	2177	January 20, 2023
Huge performance fluctuations in parallel benchmark: insights? Performance parallel , multithreading , benchmarktools	52	2925	December 1, 2021
Julia code becomes slower on running on supercomputers and does not scale well when parallelizing with Base.Threads Julia at Scale fortran , parallel , linearalgebra , threads	73	2609	January 22, 2024
How to achieve perfect scaling with Threads (Julia 1.7.1) Performance multithreading	33	2747	January 13, 2023
Garbage collection and threading Performance memory-allocation	17	2175	December 20, 2023

Again on reaching optimal parallel scaling

Scaling of the computation the cell list from scratch (very bad).

Updating the cell lists:

Computing the potential

Garbage collection

Related topics