In that plot, yes. It is relevant anyway for applications, because typically the buffer can be reused, such that the initial time for allocating stuff is diluted.
My code has a function to build the data structure from scratch, and another function to update the data structure if new coordinates are given. What I do there is to build the structure (the cell lists) from scratch first, but what I’m benchmarking there is the computation of new cell lists reusing the buffer first created, plus the mapping of the function.
The plot of the OP is a “build from scratch + mapping” benchmark, which would typically happen only once.
it is even more interesting to decompose the calcultion in each step:
Scaling of the computation the cell list from scratch (very bad).
Here is where I need to focus if I want to improve this further. Here is also where GC happens. This is less critical than it seems, because normally the second part (the mapping) takes much longer. But that is not true anymore if many cores are used.
Updating the cell lists:
This also scales very bad. But it is very fast as well (because the buffers are all preallocated from a previous step similar to the “build from scratch” above.
Clearly in from two steps above my strategy for parallelizing the construction of the cell lists is not successful.
Computing the potential
For many threads this is becoming fast and I have only one sample for each run, thus there is noise, these benchmarks must be improved. But the scaling is quite good in general. There is an expected drop for smaller systems with many threads.
For smaller number of threads, the times of this third plot are completely dominant, because that is the expensive part of the calculation, usually. For a larger number of threads the third becomes so fast that the other two become relevant and limiting to good scaling.
Garbage collection
It is late now, thus I may be doing something wrong, but looking at the data, GC explodes for large systems and larger number of threads:
Thus, I have to understand what exactly is being collected here (I think the problem is that arrays get moved in memory because they become to big for their initial allocations, thus I have to guess the sizes better to avoid that - that is even a doubt: it makes sense that if an array, when increased, does not fit in a contiguous chunk of memory it will be copied somewhere else, and then GC has to clean the original memory? If that can happen, most likely is what is going on here, because I do not have “lost” labels in the code).
I did notice a wrong data point with negative number of threads, to be checked…



