I am using TimerOutputs.jl to time the performance of specific componenets, which to be clear is really awesome and great. It has made me aware of how long I actually spend setting values back to zero:
Looking at 03B ResetArrays and 07B ResetArrays it is clear that I am spending over 1 minute in a 12 minute simulation, setting values back to zero. For reduction, 04 Reduction and 08 Reduction it also equals about a minute. So in total 1/6th of my simulatiom time is spend resetting arrays to zero and reducing arrays to get the final answer.
The reason I have to reset is that my looping is based on += simulation values each time step and the reason for having to perform the reduction is multi-threaded approach using ChunkSplitters.jl.
It is hard to provide a MWE it is really nested code in a package I am cleaning up, but I am posting here to hear some other experiences from others who have also written simulations in Julia.
fill! is quite efficient on arrays, so it’s not easy to speed it up as such. It could be that you hit the memory bandwidth limit. However, I don’t quite understand why your ResetArrays! seems to allocate. It’s not much (around 48 bytes per call), but it shouldn’t allocate at all?
Have you tried multithreading the ResetArrays!? The simplest change would be to exchange foreach with tforeach from OhMyThreads.jl (see docs here).
Can you “pre-reduce” each chunk? I mean can each thread already perform some reduction such that the main threads needs to do less? That could be more efficient.
@sgaure I am a bit unsure too, it might be a run-time dispatch it is capturing. I’ve noticed that depending on the simulation time length sometimes I see 0 bytes exactly.
@abraemer yes, multi-threading did not seem to help unfortunately.
I’ve updated the way I account for the timings to see total time spent in reduction and resetting arrays:
Which is ~170s in total, so about 1/4 of total simulation time. It feels a bit annoying that I have to spend so much on “book keeping”, but the total sum of functions it self takes ~1ms each call, so it is also a bit tough to optimize. Only reducing the number of calls seems feasible.
@abraemer what do you mean by prereduce? I split my vector into nthread pieces and then run a threaded reduction (+) afterwards.
Then you probably do already what I suggested to me it sounded like the reduction step uses the main thread to combine results from different threads. So my suggestion was to try and to redistribute the redution to the threads as much as possible such that the main thread needs to do less work overall.
@Dan, your suggestion made me finally pull my self up from the boot straps and together with ChatGPT I imagined to whip something together in which the reduction and reset is minimized heavily, since I output a different kind of particle interaction list. Before I would output it from the perspective of grid cells, now for each particle i I write its neighbours out exactly. The benefit is clearly visible:
The neighbor looping seems to have gotten faster, calculation of neighborlist (indexcounter) has gotten slower and some physics are missing, but it seems succesful enough, the improved memory access does cut time down, especially considering I whipped this out in a day, and the other code I have been fine tuning for +1 month - so I think there is potential.
It is not a complete win yet, smaller simulations (where resetting and reduction are not driving) have gotten slower, so will investigate more.
At the end of the day I think the main concept I’ve learned from asking this question is the importance of minimizing array accesses whenever you can and always use profiler to see performance.
For my (and hopefully others in the future) I just wanted to write down my final findings. I updated my to calculate per particle using the approach I described in the post above and now perform the same physics exactly as the cell based neighbor list method I did previously. I see the following results:
We are ~100 seconds slower. After reflecting on it, it does not come as a huge surprise. The gain of using a per particle method is that I avoid reduction completely and reduce resetting arrays mostly. The loss is that I perform a lot of redundant calculations since for each particle i, its neighbours j is stored and this means particle pairs of i and j will pop up repeatedly. This is why the neighbor looping time has increased by 50% and thereby the slow down.
I don’t see a trivial solution to this, I believe I will end up at the same initial problem of having to perform reductions and resets of arrays, to avoid using @atomic locks if I start to consider i and j interactions at once.
I would not say time was wasted though - going down this tangent I understand better why resetting and reduction of arrays is a “necessary evil” to get high performance neighbor looping and I have gotten some ideas of how to reduce memory bandwidth in the future.
So I will mark this as “solution” to just finalize the topic and share my experience, perhaps someone else benefits in the future.
If someone finds a very snappy way to reset array values to zero, then I will change solution to that