Again on reaching optimal parallel scaling

edit: the benchmarks I discussed here were wrong, because of wrong use of variable interpolations with @benchmark.

wrong stuff

Again thanks for the tips.

I cannot really preallocate exactly things in advance because the particles are shadowed into the boundaries to use ghost cells, and I cannot know in advance how many ghost particles there will be.

But the good thing is that I already provide means to reuse a previously allocated cell list, such that allocations are zero if the coordinates do not change (and minimal if they change, just to adapt to some possible variations). Edit: I remember now, I have tried preallocating strategies, but the computation is so cheap here that doing anything introduces an overhead. The cost of GC os a new info here.

Most interestingly, I am figuring out now, is that that I can use that feature to not preallocate, but to keep the arrays “alive”, thus not garbage collected, using:

x0, box0 = CellListMap.xatomic(5000) # small system, very fast
cl = CellList(x0,box0) # build cell lists for the small system
aux = CellListMap.AuxThreaded(cl) # preallocate auxiliary arrays for cell lists
x, box = CellListMap.xatomic(10^7) # much larger system
cl = UpdateCellList!(x,box,cl,aux) # build cell lists for the large system

Although the small and large systems are very different, and there will be a lot of allocations in the cell list update, the fact that they use the previous structure, coming from an outer scope, prevent them from being garbage-collected.

So, if we had before:

julia> @benchmark CellList($x,$box)
BenchmarkTools.Trial: 30 samples with 1 evaluation.
 Range (min … max):  116.259 ms … 305.339 ms  ┊ GC (min … max): 0.00% … 43.32%
 Time  (median):     151.771 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   167.708 ms ±  53.939 ms  ┊ GC (mean ± σ):  9.80% ± 15.52%

  ▂ █                                                            
  █▅█▁▅▁█▁█▅▁▁▁███▅▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▅▅▁▁▁▁▅▁▁▁▁▁▁▁▅▅▁▁▁▁▁▁▁▁▅ ▁
  116 ms           Histogram: frequency by time          305 ms <

 Memory estimate: 404.42 MiB, allocs estimate: 121185.

now we have:

julia> x_min, box_min = CellListMap.xatomic(5000);

julia> cl0 = CellList(x_min,box_min);

julia> aux0 = CellListMap.AuxThreaded(cl0);

julia> x, box = CellListMap.xatomic(10^6);

julia> @benchmark UpdateCellList!($x,$box,cl,aux) setup=(cl=deepcopy(cl0),aux=deepcopy(aux0)) evals=1
BenchmarkTools.Trial: 45 samples with 1 evaluation.
 Range (min … max):  100.982 ms … 111.468 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     104.191 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   104.652 ms ±   2.111 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

          ▁  ▁▁ ▁▁ ▄█     █▄ ▁                                   
  ▆▁▁▁▁▆▆▁█▆▆██▆██▆██▆▆▆▁▆██▁█▆▆▆▁▁▁▁▆▁▁▁▁▁▁▁▁▆▁▁▁▆▁▁▁▁▆▁▁▁▁▁▁▆ ▁
  101 ms           Histogram: frequency by time          111 ms <

 Memory estimate: 13.05 KiB, allocs estimate: 156.

I don’t really understand how allocations are being count here, because the result of both processes are the same and what is reported as allocations is very different*. But now I can see how these things go without the garbage collection. I have sent those tests to the cluster (now I have to wait a couple of days…), but that will probably improve things and localize the effect this source of problems.

*The time and allocations of the “preparatory steps” do not compare at all with those of the full benchmark:

julia> @btime CellList($x_min,$box_min);
  754.277 μs (4164 allocations: 7.98 MiB)

julia> @btime CellListMap.AuxThreaded($cl0)
  1.097 ms (5656 allocations: 8.57 MiB)
CellListMap.AuxThreaded{3, Float64}
 Auxiliary arrays for nbatches = 8