Spending a lot of time resetting and reduction of arrays in custom simulation

For my (and hopefully others in the future) I just wanted to write down my final findings. I updated my to calculate per particle using the approach I described in the post above and now perform the same physics exactly as the cell based neighbor list method I did previously. I see the following results:

                                                   Time                    Allocations
                                           ───────────────────────   ────────────────────────
             Tot / % measured:                   872s /  99.8%           4.03GiB /  95.3%

 Section                           ncalls     time    %tot     avg     alloc    %tot      avg
 ────────────────────────────────────────────────────────────────────────────────────────────
 04 First NeighborLoop              12.4k     371s   42.7%  29.9ms   89.9MiB    2.3%  7.42KiB
 08 Second NeighborLoop             12.4k     370s   42.5%  29.8ms   89.9MiB    2.3%  7.42KiB
 02 Calculate IndexCounter            346    48.9s    5.6%   141ms   75.4MiB    1.9%   223KiB
 11 Update To Final TimeStep        12.4k    12.3s    1.4%   990μs     0.00B    0.0%    0.00B
 05 Update To Half TimeStep         12.4k    11.7s    1.3%   945μs     0.00B    0.0%    0.00B
 01 Update TimeStep                 12.4k    11.3s    1.3%   907μs     0.00B    0.0%    0.00B
 Motion                             24.8k    9.95s    1.1%   401μs     0.00B    0.0%    0.00B
 13 Next TimeStep                   12.4k    9.47s    1.1%   763μs   74.1MiB    1.9%  6.12KiB
 ResetArrays                        24.8k    8.31s    1.0%   335μs     0.00B    0.0%    0.00B
 03 Pressure                        24.8k    4.69s    0.5%   189μs    124MiB    3.1%  5.11KiB
 12B Close hdfvtk output files          1    4.08s    0.5%   4.08s    931KiB    0.0%   931KiB
 12A Output Data                      250    2.96s    0.3%  11.8ms   3.40GiB   88.4%  13.9MiB
 10 Final Density                   12.4k    2.42s    0.3%   195μs     0.00B    0.0%    0.00B
 06 Half LimitDensityAtBoundary     12.4k    1.61s    0.2%   130μs     0.00B    0.0%    0.00B
 09 Final LimitDensityAtBoundary    12.4k    1.60s    0.2%   129μs     0.00B    0.0%    0.00B
 ────────────────────────────────────────────────────────────────────────────────────────────

We are ~100 seconds slower. After reflecting on it, it does not come as a huge surprise. The gain of using a per particle method is that I avoid reduction completely and reduce resetting arrays mostly. The loss is that I perform a lot of redundant calculations since for each particle i, its neighbours j is stored and this means particle pairs of i and j will pop up repeatedly. This is why the neighbor looping time has increased by 50% and thereby the slow down.

I don’t see a trivial solution to this, I believe I will end up at the same initial problem of having to perform reductions and resets of arrays, to avoid using @atomic locks if I start to consider i and j interactions at once.

I would not say time was wasted though - going down this tangent I understand better why resetting and reduction of arrays is a “necessary evil” to get high performance neighbor looping and I have gotten some ideas of how to reduce memory bandwidth in the future.

So I will mark this as “solution” to just finalize the topic and share my experience, perhaps someone else benefits in the future.

If someone finds a very snappy way to reset array values to zero, then I will change solution to that :blush:

1 Like