Note that most of your code can be written in a dimension-independent manner. But, in particular, change
to:
particle.f = particle.f + SVector{2,Float64}(f_x,f_y)
and that will remove the allocation of the [f_x,f_y]
intermediate vector. Removing this allocations will speedup the code in general, and may have implications for the scaling at the end.
To improve scaling you will need to improve the distribution of tasks, for example by splitting the tasks in nthreads
chuncks of similar size. I have done a cell list implementation that does that reasonably here: https://github.com/m3g/CellListMap.jl/blob/1629b7ac8c621d002eeb6294385f65f9c8d4eb17/src/CellListMap.jl#L740. But that is not ideal yet, probably.
@Skoffer gave a lecture on how to that properly here: Nbabel nbody integrator speed up - #32 by Skoffer