Spending a lot of time resetting and reduction of arrays in custom simulation

Ahmed_Salih · April 17, 2024, 2:22pm

Hello!

I am using TimerOutputs.jl to time the performance of specific componenets, which to be clear is really awesome and great. It has made me aware of how long I actually spend setting values back to zero:

                                                    Time                    Allocations
                                           ───────────────────────   ────────────────────────
             Tot / % measured:                   745s /  98.4%           2.52GiB /  81.4%

 Section                           ncalls     time    %tot     avg     alloc    %tot      avg
 ────────────────────────────────────────────────────────────────────────────────────────────
 08 Second NeighborLoop             12.4k     252s   34.4%  20.3ms    120MiB    5.7%  9.91KiB
 04 First NeighborLoop              12.4k     246s   33.5%  19.8ms    118MiB    5.6%  9.75KiB
 07B ResetArrays                    24.8k    39.4s    5.4%  1.59ms   1.15MiB    0.1%    48.4B
 03B ResetArrays                    24.8k    38.2s    5.2%  1.54ms   1.14MiB    0.1%    48.0B
 04 Reduction                       49.6k    34.2s    4.7%   690μs    248MiB   11.8%  5.12KiB
 08 Reduction                       49.6k    33.9s    4.6%   683μs    248MiB   11.8%  5.12KiB
 05 Update To Half TimeStep         12.4k    12.7s    1.7%  1.03ms   1.97KiB    0.0%    0.16B
 02 Calculate IndexCounter            346    12.5s    1.7%  36.0ms   4.09MiB    0.2%  12.1KiB
 11 Update To Final TimeStep        12.4k    12.0s    1.6%   971μs   2.39KiB    0.0%    0.20B
 01 Update TimeStep                 12.4k    11.5s    1.6%   924μs   2.95KiB    0.0%    0.24B
 13 Next TimeStep                   12.4k    9.24s    1.3%   745μs   69.3MiB    3.3%  5.72KiB
 07A ResetArrays                    24.8k    8.73s    1.2%   352μs   1.16KiB    0.0%    0.05B
 03A ResetArrays                    24.8k    8.69s    1.2%   350μs   1.14KiB    0.0%    0.05B
 12B Close hdfvtk output files          1    4.07s    0.6%   4.07s   4.86KiB    0.0%  4.86KiB
 10 Final Density                   12.4k    2.83s    0.4%   228μs      640B    0.0%    0.05B
 12A Output Data                      250    2.45s    0.3%  9.81ms   1.24GiB   60.4%  5.06MiB
 09 Final LimitDensityAtBoundary    12.4k    1.95s    0.3%   157μs      512B    0.0%    0.04B
 06 Half LimitDensityAtBoundary     12.4k    1.67s    0.2%   134μs     32.0B    0.0%    0.00B
 XX Move                            24.8k    1.48s    0.2%  59.5μs      288B    0.0%    0.01B
 XX Calculate Force                   250   43.2ms    0.0%   173μs   21.9MiB    1.0%  89.5KiB
 ────────────────────────────────────────────────────────────────────────────────────────────

Looking at 03B ResetArrays and 07B ResetArrays it is clear that I am spending over 1 minute in a 12 minute simulation, setting values back to zero. For reduction, 04 Reduction and 08 Reduction it also equals about a minute. So in total 1/6th of my simulatiom time is spend resetting arrays to zero and reducing arrays to get the final answer.

The reason I have to reset is that my looping is based on += simulation values each time step and the reason for having to perform the reduction is multi-threaded approach using ChunkSplitters.jl.

It is hard to provide a MWE it is really nested code in a package I am cleaning up, but I am posting here to hear some other experiences from others who have also written simulations in Julia.

Kind regards

Ahmed_Salih · April 17, 2024, 2:23pm

The way I reset arrays is using:

ResetArrays!(arrays...) = foreach(a -> fill!(a, zero(eltype(a))), arrays)

sgaure · April 18, 2024, 8:01am

fill! is quite efficient on arrays, so it’s not easy to speed it up as such. It could be that you hit the memory bandwidth limit. However, I don’t quite understand why your ResetArrays! seems to allocate. It’s not much (around 48 bytes per call), but it shouldn’t allocate at all?

abraemer · April 19, 2024, 7:25am

Have you tried multithreading the ResetArrays!? The simplest change would be to exchange foreach with tforeach from OhMyThreads.jl (see docs here).

Can you “pre-reduce” each chunk? I mean can each thread already perform some reduction such that the main threads needs to do less? That could be more efficient.

Ahmed_Salih · April 19, 2024, 6:21pm

@sgaure I am a bit unsure too, it might be a run-time dispatch it is capturing. I’ve noticed that depending on the simulation time length sometimes I see 0 bytes exactly.

@abraemer yes, multi-threading did not seem to help unfortunately.

I’ve updated the way I account for the timings to see total time spent in reduction and resetting arrays:

                                                    Time                    Allocations
                                           ───────────────────────   ────────────────────────
             Tot / % measured:                   755s /  98.9%           4.55GiB /  91.7%

 Section                           ncalls     time    %tot     avg     alloc    %tot      avg
 ────────────────────────────────────────────────────────────────────────────────────────────
 08 Second NeighborLoop             12.4k     256s   34.2%  20.6ms    102MiB    2.4%  8.42KiB
 04 First NeighborLoop              12.4k     252s   33.8%  20.4ms    102MiB    2.4%  8.42KiB
 ResetArrays                        99.2k    96.1s   12.9%   969μs   2.27MiB    0.1%    24.0B
 Reduction                          99.2k    69.8s    9.3%   703μs    495MiB   11.6%  5.11KiB
 05 Update To Half TimeStep         12.4k    12.7s    1.7%  1.02ms     0.00B    0.0%    0.00B
 11 Update To Final TimeStep        12.4k    12.2s    1.6%   982μs     0.00B    0.0%    0.00B
 01 Update TimeStep                 12.4k    11.6s    1.6%   934μs     96.0B    0.0%    0.01B
 02 Calculate IndexCounter            346    11.5s    1.5%  33.2ms   1.68MiB    0.0%  4.98KiB
 13 Next TimeStep                   12.4k    9.53s    1.3%   768μs   73.0MiB    1.7%  6.03KiB
 12B Close hdfvtk output files          1    4.41s    0.6%   4.41s    931KiB    0.0%   931KiB
 12A Output Data                      250    3.16s    0.4%  12.6ms   3.40GiB   81.3%  13.9MiB
 10 Final Density                   12.4k    2.68s    0.4%   216μs     0.00B    0.0%    0.00B
 09 Final LimitDensityAtBoundary    12.4k    1.93s    0.3%   156μs     0.00B    0.0%    0.00B
 06 Half LimitDensityAtBoundary     12.4k    1.69s    0.2%   136μs     0.00B    0.0%    0.00B
 XX Move                            24.8k    1.65s    0.2%  66.5μs     80.0B    0.0%    0.00B
 XX Calculate Force                   250   45.9ms    0.0%   183μs   21.8MiB    0.5%  89.4KiB
 ────────────────────────────────────────────────────────────────────────────────────────────

Which is ~170s in total, so about 1/4 of total simulation time. It feels a bit annoying that I have to spend so much on “book keeping”, but the total sum of functions it self takes ~1ms each call, so it is also a bit tough to optimize. Only reducing the number of calls seems feasible.

@abraemer what do you mean by prereduce? I split my vector into nthread pieces and then run a threaded reduction (+) afterwards.

Kind regards

abraemer · April 21, 2024, 1:42am

Then you probably do already what I suggested to me it sounded like the reduction step uses the main thread to combine results from different threads. So my suggestion was to try and to redistribute the redution to the threads as much as possible such that the main thread needs to do less work overall.

Dan · April 21, 2024, 2:03am

If, instead of a += you add a condition which uses = for the first update, then no resetting will be needed.

For example:

julia> let s = 1000.0
           for i in 1:100
               s = ifelse(i==1, 0, s) + rand()
           end
           s
       end
43.77359708802116

(the output is a small sum and the initial value was overwritten)

The condition is calculated quite efficiently in CPU and it is important to think about memory transfers as the limiting factor in algorithms often.

Ahmed_Salih · April 21, 2024, 7:10pm

@abraemer makes sense, thank you for explaining.

@Dan, your suggestion made me finally pull my self up from the boot straps and together with ChatGPT I imagined to whip something together in which the reduction and reset is minimized heavily, since I output a different kind of particle interaction list. Before I would output it from the perspective of grid cells, now for each particle i I write its neighbours out exactly. The benefit is clearly visible:

Old

                                                    Time                    Allocations
                                           ───────────────────────   ────────────────────────
             Tot / % measured:                   755s /  98.9%           4.55GiB /  91.7%

 Section                           ncalls     time    %tot     avg     alloc    %tot      avg
 ────────────────────────────────────────────────────────────────────────────────────────────
 08 Second NeighborLoop             12.4k     256s   34.2%  20.6ms    102MiB    2.4%  8.42KiB
 04 First NeighborLoop              12.4k     252s   33.8%  20.4ms    102MiB    2.4%  8.42KiB
 ResetArrays                        99.2k    96.1s   12.9%   969μs   2.27MiB    0.1%    24.0B
 Reduction                          99.2k    69.8s    9.3%   703μs    495MiB   11.6%  5.11KiB
 05 Update To Half TimeStep         12.4k    12.7s    1.7%  1.02ms     0.00B    0.0%    0.00B
 11 Update To Final TimeStep        12.4k    12.2s    1.6%   982μs     0.00B    0.0%    0.00B
 01 Update TimeStep                 12.4k    11.6s    1.6%   934μs     96.0B    0.0%    0.01B
 02 Calculate IndexCounter            346    11.5s    1.5%  33.2ms   1.68MiB    0.0%  4.98KiB
 13 Next TimeStep                   12.4k    9.53s    1.3%   768μs   73.0MiB    1.7%  6.03KiB
 12B Close hdfvtk output files          1    4.41s    0.6%   4.41s    931KiB    0.0%   931KiB
 12A Output Data                      250    3.16s    0.4%  12.6ms   3.40GiB   81.3%  13.9MiB
 10 Final Density                   12.4k    2.68s    0.4%   216μs     0.00B    0.0%    0.00B
 09 Final LimitDensityAtBoundary    12.4k    1.93s    0.3%   156μs     0.00B    0.0%    0.00B
 06 Half LimitDensityAtBoundary     12.4k    1.69s    0.2%   136μs     0.00B    0.0%    0.00B
 XX Move                            24.8k    1.65s    0.2%  66.5μs     80.0B    0.0%    0.00B
 XX Calculate Force                   250   45.9ms    0.0%   183μs   21.8MiB    0.5%  89.4KiB
 ────────────────────────────────────────────────────────────────────────────────────────────

New

                                                    Time                    Allocations
                                           ───────────────────────   ────────────────────────
             Tot / % measured:                   540s /  99.8%           3.99GiB /  95.5%

 Section                           ncalls     time    %tot     avg     alloc    %tot      avg
 ────────────────────────────────────────────────────────────────────────────────────────────
 08 Second NeighborLoop             12.4k     211s   39.1%  17.0ms   86.3MiB    2.2%  7.12KiB
 04 First NeighborLoop              12.4k     208s   38.5%  16.7ms   86.0MiB    2.2%  7.10KiB
 02 Calculate IndexCounter            347    46.2s    8.6%   133ms   75.4MiB    1.9%   223KiB
 11 Update To Final TimeStep        12.4k    11.7s    2.2%   940μs     0.00B    0.0%    0.00B
 05 Update To Half TimeStep         12.4k    11.2s    2.1%   901μs      192B    0.0%    0.02B
 01 Update TimeStep                 12.4k    11.0s    2.0%   886μs     0.00B    0.0%    0.00B
 Motion                             24.8k    9.54s    1.8%   385μs      384B    0.0%    0.02B
 ResetArrays                        24.8k    7.93s    1.5%   319μs      384B    0.0%    0.02B
 13 Next TimeStep                   12.4k    6.24s    1.2%   503μs   50.1MiB    1.3%  4.14KiB
 03 Pressure                        24.8k    4.30s    0.8%   173μs    124MiB    3.2%  5.12KiB
 12B Close hdfvtk output files          1    4.29s    0.8%   4.29s   4.86KiB    0.0%  4.86KiB
 12A Output Data                      250    3.19s    0.6%  12.7ms   3.40GiB   89.2%  13.9MiB
 10 Final Density                   12.4k    2.30s    0.4%   185μs     0.00B    0.0%    0.00B
 09 Final LimitDensityAtBoundary    12.4k    1.60s    0.3%   129μs     0.00B    0.0%    0.00B
 06 Half LimitDensityAtBoundary     12.4k    1.58s    0.3%   128μs     0.00B    0.0%    0.00B
 ────────────────────────────────────────────────────────────────────────────────────────────

The neighbor looping seems to have gotten faster, calculation of neighborlist (indexcounter) has gotten slower and some physics are missing, but it seems succesful enough, the improved memory access does cut time down, especially considering I whipped this out in a day, and the other code I have been fine tuning for +1 month - so I think there is potential.

It is not a complete win yet, smaller simulations (where resetting and reduction are not driving) have gotten slower, so will investigate more.

At the end of the day I think the main concept I’ve learned from asking this question is the importance of minimizing array accesses whenever you can and always use profiler to see performance.

Ahmed_Salih · April 22, 2024, 8:06am

For my (and hopefully others in the future) I just wanted to write down my final findings. I updated my to calculate per particle using the approach I described in the post above and now perform the same physics exactly as the cell based neighbor list method I did previously. I see the following results:

                                                   Time                    Allocations
                                           ───────────────────────   ────────────────────────
             Tot / % measured:                   872s /  99.8%           4.03GiB /  95.3%

 Section                           ncalls     time    %tot     avg     alloc    %tot      avg
 ────────────────────────────────────────────────────────────────────────────────────────────
 04 First NeighborLoop              12.4k     371s   42.7%  29.9ms   89.9MiB    2.3%  7.42KiB
 08 Second NeighborLoop             12.4k     370s   42.5%  29.8ms   89.9MiB    2.3%  7.42KiB
 02 Calculate IndexCounter            346    48.9s    5.6%   141ms   75.4MiB    1.9%   223KiB
 11 Update To Final TimeStep        12.4k    12.3s    1.4%   990μs     0.00B    0.0%    0.00B
 05 Update To Half TimeStep         12.4k    11.7s    1.3%   945μs     0.00B    0.0%    0.00B
 01 Update TimeStep                 12.4k    11.3s    1.3%   907μs     0.00B    0.0%    0.00B
 Motion                             24.8k    9.95s    1.1%   401μs     0.00B    0.0%    0.00B
 13 Next TimeStep                   12.4k    9.47s    1.1%   763μs   74.1MiB    1.9%  6.12KiB
 ResetArrays                        24.8k    8.31s    1.0%   335μs     0.00B    0.0%    0.00B
 03 Pressure                        24.8k    4.69s    0.5%   189μs    124MiB    3.1%  5.11KiB
 12B Close hdfvtk output files          1    4.08s    0.5%   4.08s    931KiB    0.0%   931KiB
 12A Output Data                      250    2.96s    0.3%  11.8ms   3.40GiB   88.4%  13.9MiB
 10 Final Density                   12.4k    2.42s    0.3%   195μs     0.00B    0.0%    0.00B
 06 Half LimitDensityAtBoundary     12.4k    1.61s    0.2%   130μs     0.00B    0.0%    0.00B
 09 Final LimitDensityAtBoundary    12.4k    1.60s    0.2%   129μs     0.00B    0.0%    0.00B
 ────────────────────────────────────────────────────────────────────────────────────────────

We are ~100 seconds slower. After reflecting on it, it does not come as a huge surprise. The gain of using a per particle method is that I avoid reduction completely and reduce resetting arrays mostly. The loss is that I perform a lot of redundant calculations since for each particle i, its neighbours j is stored and this means particle pairs of i and j will pop up repeatedly. This is why the neighbor looping time has increased by 50% and thereby the slow down.

I don’t see a trivial solution to this, I believe I will end up at the same initial problem of having to perform reductions and resets of arrays, to avoid using @atomic locks if I start to consider i and j interactions at once.

I would not say time was wasted though - going down this tangent I understand better why resetting and reduction of arrays is a “necessary evil” to get high performance neighbor looping and I have gotten some ideas of how to reduce memory bandwidth in the future.

So I will mark this as “solution” to just finalize the topic and share my experience, perhaps someone else benefits in the future.

If someone finds a very snappy way to reset array values to zero, then I will change solution to that

Topic		Replies	Views
Variation in computation time for execution of the exact same code Performance	35	2185	March 27, 2020
Understanding Julia performance in simple finite difference code New to Julia	38	4512	June 7, 2018
Poor performance due to memory allocations? Performance memory-allocation	17	2875	January 15, 2019
How should I approach GPU programming? GPU	27	2892	April 17, 2020
Trying to understand how to write good code (example) New to Julia	34	3034	February 16, 2017

Spending a lot of time resetting and reduction of arrays in custom simulation

Related topics