edit: the benchmarks I discussed here were wrong, because of wrong use of variable interpolations with @benchmark
.
wrong stuff
Again thanks for the tips.
I cannot really preallocate exactly things in advance because the particles are shadowed into the boundaries to use ghost cells, and I cannot know in advance how many ghost particles there will be.
But the good thing is that I already provide means to reuse a previously allocated cell list, such that allocations are zero if the coordinates do not change (and minimal if they change, just to adapt to some possible variations). Edit: I remember now, I have tried preallocating strategies, but the computation is so cheap here that doing anything introduces an overhead. The cost of GC os a new info here.
Most interestingly, I am figuring out now, is that that I can use that feature to not preallocate, but to keep the arrays “alive”, thus not garbage collected, using:
x0, box0 = CellListMap.xatomic(5000) # small system, very fast
cl = CellList(x0,box0) # build cell lists for the small system
aux = CellListMap.AuxThreaded(cl) # preallocate auxiliary arrays for cell lists
x, box = CellListMap.xatomic(10^7) # much larger system
cl = UpdateCellList!(x,box,cl,aux) # build cell lists for the large system
Although the small and large systems are very different, and there will be a lot of allocations in the cell list update, the fact that they use the previous structure, coming from an outer scope, prevent them from being garbage-collected.
So, if we had before:
julia> @benchmark CellList($x,$box)
BenchmarkTools.Trial: 30 samples with 1 evaluation.
Range (min … max): 116.259 ms … 305.339 ms ┊ GC (min … max): 0.00% … 43.32%
Time (median): 151.771 ms ┊ GC (median): 0.00%
Time (mean ± σ): 167.708 ms ± 53.939 ms ┊ GC (mean ± σ): 9.80% ± 15.52%
▂ █
█▅█▁▅▁█▁█▅▁▁▁███▅▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▅▅▁▁▁▁▅▁▁▁▁▁▁▁▅▅▁▁▁▁▁▁▁▁▅ ▁
116 ms Histogram: frequency by time 305 ms <
Memory estimate: 404.42 MiB, allocs estimate: 121185.
now we have:
julia> x_min, box_min = CellListMap.xatomic(5000);
julia> cl0 = CellList(x_min,box_min);
julia> aux0 = CellListMap.AuxThreaded(cl0);
julia> x, box = CellListMap.xatomic(10^6);
julia> @benchmark UpdateCellList!($x,$box,cl,aux) setup=(cl=deepcopy(cl0),aux=deepcopy(aux0)) evals=1
BenchmarkTools.Trial: 45 samples with 1 evaluation.
Range (min … max): 100.982 ms … 111.468 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 104.191 ms ┊ GC (median): 0.00%
Time (mean ± σ): 104.652 ms ± 2.111 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▁ ▁▁ ▁▁ ▄█ █▄ ▁
▆▁▁▁▁▆▆▁█▆▆██▆██▆██▆▆▆▁▆██▁█▆▆▆▁▁▁▁▆▁▁▁▁▁▁▁▁▆▁▁▁▆▁▁▁▁▆▁▁▁▁▁▁▆ ▁
101 ms Histogram: frequency by time 111 ms <
Memory estimate: 13.05 KiB, allocs estimate: 156.
I don’t really understand how allocations are being count here, because the result of both processes are the same and what is reported as allocations is very different*. But now I can see how these things go without the garbage collection. I have sent those tests to the cluster (now I have to wait a couple of days…), but that will probably improve things and localize the effect this source of problems.
*The time and allocations of the “preparatory steps” do not compare at all with those of the full benchmark:
julia> @btime CellList($x_min,$box_min);
754.277 μs (4164 allocations: 7.98 MiB)
julia> @btime CellListMap.AuxThreaded($cl0)
1.097 ms (5656 allocations: 8.57 MiB)
CellListMap.AuxThreaded{3, Float64}
Auxiliary arrays for nbatches = 8