For my (and hopefully others in the future) I just wanted to write down my final findings. I updated my to calculate per particle using the approach I described in the post above and now perform the same physics exactly as the cell based neighbor list method I did previously. I see the following results:
Time Allocations
─────────────────────── ────────────────────────
Tot / % measured: 872s / 99.8% 4.03GiB / 95.3%
Section ncalls time %tot avg alloc %tot avg
────────────────────────────────────────────────────────────────────────────────────────────
04 First NeighborLoop 12.4k 371s 42.7% 29.9ms 89.9MiB 2.3% 7.42KiB
08 Second NeighborLoop 12.4k 370s 42.5% 29.8ms 89.9MiB 2.3% 7.42KiB
02 Calculate IndexCounter 346 48.9s 5.6% 141ms 75.4MiB 1.9% 223KiB
11 Update To Final TimeStep 12.4k 12.3s 1.4% 990μs 0.00B 0.0% 0.00B
05 Update To Half TimeStep 12.4k 11.7s 1.3% 945μs 0.00B 0.0% 0.00B
01 Update TimeStep 12.4k 11.3s 1.3% 907μs 0.00B 0.0% 0.00B
Motion 24.8k 9.95s 1.1% 401μs 0.00B 0.0% 0.00B
13 Next TimeStep 12.4k 9.47s 1.1% 763μs 74.1MiB 1.9% 6.12KiB
ResetArrays 24.8k 8.31s 1.0% 335μs 0.00B 0.0% 0.00B
03 Pressure 24.8k 4.69s 0.5% 189μs 124MiB 3.1% 5.11KiB
12B Close hdfvtk output files 1 4.08s 0.5% 4.08s 931KiB 0.0% 931KiB
12A Output Data 250 2.96s 0.3% 11.8ms 3.40GiB 88.4% 13.9MiB
10 Final Density 12.4k 2.42s 0.3% 195μs 0.00B 0.0% 0.00B
06 Half LimitDensityAtBoundary 12.4k 1.61s 0.2% 130μs 0.00B 0.0% 0.00B
09 Final LimitDensityAtBoundary 12.4k 1.60s 0.2% 129μs 0.00B 0.0% 0.00B
────────────────────────────────────────────────────────────────────────────────────────────
We are ~100 seconds slower. After reflecting on it, it does not come as a huge surprise. The gain of using a per particle method is that I avoid reduction completely and reduce resetting arrays mostly. The loss is that I perform a lot of redundant calculations since for each particle i, its neighbours j is stored and this means particle pairs of i and j will pop up repeatedly. This is why the neighbor looping time has increased by 50% and thereby the slow down.
I don’t see a trivial solution to this, I believe I will end up at the same initial problem of having to perform reductions and resets of arrays, to avoid using @atomic locks if I start to consider i and j interactions at once.
I would not say time was wasted though - going down this tangent I understand better why resetting and reduction of arrays is a “necessary evil” to get high performance neighbor looping and I have gotten some ideas of how to reduce memory bandwidth in the future.
So I will mark this as “solution” to just finalize the topic and share my experience, perhaps someone else benefits in the future.
If someone finds a very snappy way to reset array values to zero, then I will change solution to that ![]()