Indeed, that seems to be an important part of the problem!

The calculation has two parts, constructing the cell lists, and computing the pairwise interactions given the cell lists (similar to what neighbor lists algorithms do with ball trees).

Building the cell lists is of course allocating, but the mapping can be allocation free (depending on the function to be mapped, but in this case it is).

The second part is most times by far the most expensive one. For instance, this is what I get with one thread:

- Time for building the cell lists:

```
julia> t1
BenchmarkTools.Trial: 2 samples with 1 evaluation.
Range (min β¦ max): 3.738 s β¦ 4.139 s β GC (min β¦ max): 3.11% β¦ 5.38%
Time (median): 3.938 s β GC (median): 4.30%
Time (mean Β± Ο): 3.938 s Β± 284.052 ms β GC (mean Β± Ο): 4.30% Β± 1.60%
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
3.74 s Histogram: frequency by time 4.14 s <
Memory estimate: 1.59 GiB, allocs estimate: 100413.
```

- Time for computing the function:

```
julia> t2
BenchmarkTools.Trial: 1 sample with 1 evaluation.
Single result which took 54.880 s (0.00% GC) to evaluate,
with a memory estimate of 3.80 KiB, over 34 allocations.
```

GC time is, thus, 5% of the time required for building the cell lists, but I was not caring too much about that because the second part is much slower anyway.

The construction of the cell lists is not very easy to parallelize, because the time required for reduction is comparable to the time of the computations, even if using a tree-based asynchronous reduction (sorry if the terminology is nonsense, but I guess you understand what I mean).

I donβt have quick access to the 128-core machine (it takes 3 days for anything start running), but in a 16 core machine I can test things quickly I get:

```
julia> t1
BenchmarkTools.Trial: 3 samples with 1 evaluation.
Range (min β¦ max): 2.196 s β¦ 2.499 s β GC (min β¦ max): 56.57% β¦ 51.40%
Time (median): 2.219 s β GC (median): 55.98%
Time (mean Β± Ο): 2.305 s Β± 168.488 ms β GC (mean Β± Ο): 52.53% Β± 3.54%
β β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
2.2 s Histogram: frequency by time 2.5 s <
Memory estimate: 3.81 GiB, allocs estimate: 1722817.
```

```
julia> t2
BenchmarkTools.Trial: 2 samples with 1 evaluation.
Range (min β¦ max): 3.439 s β¦ 3.618 s β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 3.528 s β GC (median): 0.00%
Time (mean Β± Ο): 3.528 s Β± 126.171 ms β GC (mean Β± Ο): 0.00% Β± 0.00%
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
3.44 s Histogram: frequency by time 3.62 s <
Memory estimate: 55.86 KiB, allocs estimate: 400.
```

The scaling of the second part is almost perfect: 54.88/3.44 = 15.95 (for 16 threads).

The first part has two issues: first, with perfect scaling it should take 0.23s, but even if we discount 52% GC time, that would be about 1s. I donβt know if I can improve that much further.

But with 52% GC time that went to 2s! And that with 16 threads. I guess that GC time with 64-128 threads is explodingβ¦ I will check that, and that might explain at least part of the slowdowns.

If I run with 8 threads, I get for the first part 1.343s and 31%GC time. Thus, if the GC time is increasing in that pace (roughly proportional to the number of threads), one would expect that GC completely dominates the time for the construction of the cell lists and since the other part is fast, the complete calculation.

**So, preliminary conclusion: focus on the effect on GC on the parallelization of the construction of the cell lists.**

Thanks very much @tkf also for the references, they will certainly be helpful. I think the toughest challenge is to come up with an efficient parallel way to construct the cell lists, because the computation is cheap and necessarily require allocating stuff, but it seems that to scale things further that has to be sorted out, because the second part appears to be running fast enough in comparison.

By the way, benchmark code Iβm running is this one:

##
Code

```
using CellListMap using FastPow
using BenchmarkTools
ulj(d2,u) = @fastpow u += 4*(1/d2^6 - 1/d2^3)
function scaling(N=10^6;parallel=true)
nthreads = Threads.nthreads()
GC.gc()
# setup points (irrelevant)
t0 = @benchmark CellListMap.xatomic($N)
x, box = CellListMap.xatomic(N)
# Create cell lists (minor... or not)
t1 = @benchmark CellList($x,$box,parallel=$parallel)
umap(x,y,i,j,d2,u) = ulj(d2,u)
cl = CellList(x,box,parallel=parallel)
# Pairwise computation (expensive)
t2 = @benchmark map_pairwise($umap, 0., $box, $cl, parallel=$parallel)
return t0, t1, t2
end
# the times reported are obtained by running:
t0, t1, t2 = scaling(8*10^6)
# where t1 is the benchmark of the construction of the cell lists and t2 is the time
# required for computing the potential
```

*Iβm using `JULIA_EXCLUSIVE=1`

in these tests.