# Speeding up force calculations and mutable structs

I put together a simple package to simulate many interacting Brownian particles. I’m trying to speed up the force calculation (for Lennard-Jones). Below is the force calculation using cell lists to find neighbors. The part I want to bring your attention to is towards the end (although the rest of the code could be relevant). If it helps, I could link the package or attach more code. I apologize if this is difficult to follow. I’m not sure what code to link here since it’s all interconnected.

function compute_pair_interaction!(lj::LennardJones; period_x::Float64 = -1.0, period_y::Float64 = -1.0)
x, y = particle.x, particle.y
i = trunc(Int64, x / lj.cell_list.cell_spacing_x)
j = trunc(Int64, y / lj.cell_list.cell_spacing_y)
@inbounds for di = -1 : 1, dj = -1 : 1
idi = mod(i + di, lj.cell_list.num_cells_x) + 1
jdj = mod(j + dj, lj.cell_list.num_cells_y) + 1
pid = lj.cell_list.start_pid[idi, jdj]
while pid > 0
neighbor = lj.cell_list.particles[pid]
Δx = wrap_displacement(x - neighbor.x; period = period_x)
Δy = wrap_displacement(y - neighbor.y; period = period_y)
Δr² = Δx^2 + Δy^2

σ² = (lj.σ < 0.0 ? particle.R + neighbor.R : lj.σ)^2
cutoff² = (lj.cutoff < 0.0 ? LJ_CONST * σ² : lj.cutoff^2)
if 0.0 < Δr² < cutoff²
coef = lj_coef(lj.ϵ, σ², Δr²)
f_x, f_y = coef * Δx, coef * Δy

#THIS PART SEEMS TO BE SLOW
particle.f_x += f_x
particle.f_y += f_y
if lj.use_newton_3rd
neighbor.f_x -= f_x
neighbor.f_y -= f_y
end
end
pid = lj.cell_list.next_pid[pid]
end
end
end
end


The chunk of code

particle.f_x += f_x
particle.f_y += f_y
if lj.use_newton_3rd
neighbor.f_x -= f_x
neighbor.f_y -= f_y
end


seems to take up about half the computation time (using @benchmark, the whole function takes about 1 ms). particle and neighbor are a custom type Particle containing all the information for a particle:

mutable struct Particle
ptype::Symbol # particle type

x::Float64 # x position
y::Float64 # y position
θ::Float64 # orientation

f_x::Float64 # x force
f_y::Float64 # y force
t_θ::Float64 # torque in z

γ_trans::Float64 # translational friction
γ_rot::Float64 # rotational friction
D_trans::Float64 # translational diffusivity
D_rot::Float64 # rotational diffusivity

active_force::Union{AbstractActiveForce, Nothing} # active force/self-propulsion
end


I did some tests by commenting out different parts of the calculation.

TEST 1: I commented out

particle.f_x += f_x
particle.f_y += f_y


TEST 2: I commented out

if lj.use_newton_3rd
neighbor.f_x -= f_x
neighbor.f_y -= f_y
end


TEST 3: I commented out both chunks.

Note that I have also tested this without multithreading and get similar speed-up. Now, TEST 1 and TEST 2 both give ~1 ms for @benchmark. TEST 3, however, gives ~500 \mu\textrm{s} (half the time) for @benchmark. It seems like they individually don’t contribute a lot to the time but collectively they do. Is this part slow because of my use of mutable structs? There are a few other places where I use a similar struct.field += and those don’t seem to contribute nearly as much time. I know that most other codes store particle positions in an array, so if it comes to it, I’ll probably have to rewrite everything, but for the moment I like the organization of Particle.

Another thing that is a little strange is that I am testing with lj.use_newton_3rd set as false. Logically, this means that TEST 1 should be equivalent to TEST 3, but it still takes ~1 ms instead of ~500 \mu\textrm{s}. It only goes down to ~500 \mu\textrm{s} if I replace if lj.use_newton_3rd with if false. I’m guessing this has to do with there being a lot of background code being generated with if lj.use_newton_3rd?

2 Likes

My guess is that what’s happening is that the compiler realizes that if you don’t modify either particle or neighbor that your function doesn’t have any effects. Thus the compiler is free to remove most of the code of your function.

3 Likes

Thanks for your response! Do you mean that the compiler doesn’t know the outcome of the if statements until runtime so it must generate code for both true and false? If so, is there anyway around this? I guess a minimal example is the following

using BenchmarkTools

mutable struct mystruct
x::Float64
y::Float64

b::Bool

u::Float64
v::Float64
end

function myfunction(arr::Array{mystruct, 1})
for a in arr
for _ = 1 : 10
val = cos(1.0)

a.x += val
a.y += val
if a.b
a.u -= val
a.v -= val
end
end
end
end

test = [mystruct(0.0, 0.0, false, 0.0, 0.0) for _ = 1 : 100000]
@benchmark myfunction(\$test)


TEST 1: comment out

a.x += val
a.y += val


TEST 2: comment out

if a.b
a.u -= val
a.v -= val
end


TEST 3: comment out both.

The whole function takes ~1.4 ms. TEST 1 and TEST 2 still give ~1.4 ms. TEST3 gives ~400 \mu\textrm{s}. I think this captures the issue.

There is no real point benchmarking against a function with the assignments removed, because the compiler might just cut everything else as well. Instead benchmark alternate methods that actually do what you want.

But from reading it, writing to mutable structs neighbors and particle is not the fastest way to do this. Just use a variable neighbor_f_x etc inside the for loop, not neighbor.f_x. Then copy it over outside the loop if you have to.

2 Likes

Have you considered profiling?
The packages pprof.jl and profileview.jl may help you understand which lines/function calls take up a lot of time.

3 Likes

This will let you write immutable structs but wrap them in lens to simulate mutability.

2 Likes

Ya, I realized that that is not the proper way to test things. I tried an alternative where I created an array f_thread = [[0.0, 0.0] for _ = 1 : Threads.nthread()] so that each thread can store the force with f_thread[Threads.threadid()][1] += f_x and similarly for f_y. This way I can move particle.f_x += ... outside the neighbor finding while loop. I still get pretty much the exact same performance. I also tried it without threads by using a stored tempf = [[0.0, 0.0] for _ = 1 : N_particles] and updated that inside the force calculation with tempf[n][1] += f_x. Then outside the entire force calculation I updated particle.f_x. This was actually slower (by about x1.5-x2), presumably because I now have two loops over the particles: one for calculating force and one for updating force.

I tried profileview.jl. I ran @profview on my minimal example. The results varied but quite frequently there were a lot of calls to abstractinterpretation.jl and typeinfer.jl. I saw a similar thing with my compute_pair_interactions! when I used the built in julia profiler. Is this due to my custom structs? Here is an example. The only thing I see related to my function is at the end.

=========================================================
╎ 8 @Atom\src\eval.jl:41; macro expansion
╎  8 @Base\essentials.jl:709; invokelatest(::Any, ::Any, ::Vararg{Any,N} where N)
╎   8 @Base\essentials.jl:710; invokelatest(::Any, ::Any, ::Vararg{Any,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union...
╎    8 @Atom\src\eval.jl:113; eval(::String, ::Int64, ::String, ::String, ::Bool)
╎     8 C:\Users\Michael Wang\.julia\packages\Media\ItEPc\src\dynamic.jl:24; macro expansion
╎    ╎ 8 @Atom\src\eval.jl:116; macro expansion
╎    ╎  8 @Atom\src\repl.jl:127; hideprompt(::Atom.var"#200#205"{String,Int64,String,Bool})
╎    ╎   8 @Atom\src\eval.jl:117; #200
╎    ╎    8 @Base\logging.jl:514; with_logger
╎    ╎     8 @Base\logging.jl:408; with_logstate(::Function, ::Any)
╎    ╎    ╎ 8 @Atom\src\eval.jl:118; (::Atom.var"#201#206"{String,Int64,String,Bool})()
╎    ╎    ╎  8 @Atom\src\eval.jl:9; withpath(::Function, ::String)
╎    ╎    ╎   8 @CodeTools\src\utils.jl:30; withpath(::Atom.var"#202#207"{String,Int64,String,Bool}, ::String)
╎    ╎    ╎    8 @Atom\src\eval.jl:119; (::Atom.var"#202#207"{String,Int64,String,Bool})()
╎    ╎    ╎     8 @CodeTools\src\eval.jl:30; include_string(::Module, ::String, ::String, ::Int64)
╎    ╎    ╎    ╎   4 @Base\compiler\typeinfer.jl:601; typeinf_ext(::Core.MethodInstance, ::UInt64)
╎    ╎    ╎    ╎    4 @Base\compiler\typeinfer.jl:570; typeinf_ext(::Core.MethodInstance, ::Core.Compiler.Params)
╎    ╎    ╎    ╎     3 @Base\compiler\typeinfer.jl:12; typeinf(::Core.Compiler.InferenceState)
╎    ╎    ╎    ╎    ╎ 3 @Base\compiler\abstractinterpretation.jl:1326; typeinf_nocycle(::Core.Compiler.InferenceState)
╎    ╎    ╎    ╎    ╎  3 @Base\compiler\abstractinterpretation.jl:1270; typeinf_local(::Core.Compiler.InferenceState)
╎    ╎    ╎    ╎    ╎   3 @Base\compiler\abstractinterpretation.jl:1005; abstract_eval(::Any, ::Array{Any,1}, ::Core.Compiler.InferenceState)
╎    ╎    ╎    ╎    ╎    3 @Base\compiler\abstractinterpretation.jl:911; abstract_call(::Array{Any,1}, ::Array{Any,1}, ::Array{Any,1}, ::Core.Compiler.Infe...
╎    ╎    ╎    ╎    ╎     3 @Base\compiler\abstractinterpretation.jl:926; abstract_call(::Array{Any,1}, ::Array{Any,1}, ::Array{Any,1}, ::Core.Compiler.Inf...
╎    ╎    ╎    ╎    ╎    ╎ 1 @Base\compiler\abstractinterpretation.jl:903; abstract_call_known(::Any, ::Array{Any,1}, ::Array{Any,1}, ::Array{Any,1}, ::Core...
╎    ╎    ╎    ╎    ╎    ╎  1 @Base\compiler\typeutils.jl:46; argtypes_to_type
╎    ╎    ╎    ╎    ╎    ╎   1 @Base\compiler\utilities.jl:39; anymap(::typeof(Core.Compiler.widenconst), ::Array{Any,1})
1╎    ╎    ╎    ╎    ╎    ╎    1 @Base\compiler\typelattice.jl:206; widenconst(::Core.Compiler.Const)
╎    ╎    ╎    ╎    ╎    ╎ 2 @Base\compiler\abstractinterpretation.jl:904; abstract_call_known(::Any, ::Array{Any,1}, ::Array{Any,1}, ::Array{Any,1}, ::Core...
╎    ╎    ╎    ╎    ╎    ╎  2 @Base\compiler\abstractinterpretation.jl:134; abstract_call_gf_by_type(::Any, ::Array{Any,1}, ::Any, ::Core.Compiler.Inference...
╎    ╎    ╎    ╎    ╎    ╎   2 @Base\compiler\abstractinterpretation.jl:266; abstract_call_method_with_const_args(::Any, ::Any, ::Array{Any,1}, ::Core.Simpl...
╎    ╎    ╎    ╎    ╎    ╎    2 @Base\compiler\typeinfer.jl:33; typeinf(::Core.Compiler.InferenceState)
╎    ╎    ╎    ╎    ╎    ╎     2 @Base\compiler\optimize.jl:174; optimize(::Core.Compiler.OptimizationState, ::Any)
╎    ╎    ╎    ╎    ╎    ╎    ╎ 1 @Base\compiler\ssair\driver.jl:134; run_passes(::Core.CodeInfo, ::Int64, ::Core.Compiler.OptimizationState)
╎    ╎    ╎    ╎    ╎    ╎    ╎  1 @Base\compiler\ssair\driver.jl:127; slot2reg
╎    ╎    ╎    ╎    ╎    ╎    ╎   1 @Base\compiler\ssair\slot2ssa.jl:626; construct_ssa!(::Core.CodeInfo, ::Core.Compiler.IRCode, ::Core.Compiler.DomT...
╎    ╎    ╎    ╎    ╎    ╎    ╎    1 @Base\compiler\ssair\slot2ssa.jl:519; compute_live_ins(::Core.Compiler.CFG, ::Core.Compiler.SlotInfo)
╎    ╎    ╎    ╎    ╎    ╎    ╎     1 @Base\sort.jl:780; sort##kw
╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 1 @Base\sort.jl:780; #sort#8
╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  1 @Base\abstractarray.jl:971; copymutable
╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   1 @Base\array.jl:350; copyto!
╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    1 @Base\array.jl:326; copyto!
╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     1 @Base\array.jl:335; _copyto_impl!(::Array{Tuple{Int64,Int64,Bool},1}, ::Int64, ::Array{Tuple...
╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 1 @Base\array.jl:293; unsafe_copyto!
╎    ╎    ╎    ╎    ╎    ╎    ╎ 1 @Base\compiler\ssair\driver.jl:138; run_passes(::Core.CodeInfo, ::Int64, ::Core.Compiler.OptimizationState)
╎    ╎    ╎    ╎    ╎    ╎    ╎  1 @Base\compiler\ssair\inlining.jl:74; ssa_inlining_pass!
╎    ╎    ╎    ╎    ╎    ╎    ╎   1 @Base\compiler\ssair\inlining.jl:999; assemble_inline_todo!(::Core.Compiler.IRCode, ::Core.Compiler.OptimizationSt...
╎    ╎    ╎    ╎    ╎    ╎    ╎    1 @Base\compiler\ssair\inlining.jl:982; process_simple!(::Core.Compiler.IRCode, ::Int64, ::Core.Compiler.Params)
╎    ╎    ╎    ╎    ╎    ╎    ╎     1 @Base\compiler\ssair\inlining.jl:20; with_atype
1╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 1 @Base\compiler\ssair\inlining.jl:18; Core.Compiler.Signature(::Any, ::Any, ::Any, ::Any)
╎    ╎    ╎    ╎     1 @Base\compiler\typeinfer.jl:33; typeinf(::Core.Compiler.InferenceState)
╎    ╎    ╎    ╎    ╎ 1 @Base\compiler\optimize.jl:174; optimize(::Core.Compiler.OptimizationState, ::Any)
╎    ╎    ╎    ╎    ╎  1 @Base\compiler\ssair\driver.jl:138; run_passes(::Core.CodeInfo, ::Int64, ::Core.Compiler.OptimizationState)
╎    ╎    ╎    ╎    ╎   1 @Base\compiler\ssair\inlining.jl:74; ssa_inlining_pass!
╎    ╎    ╎    ╎    ╎    1 @Base\compiler\ssair\inlining.jl:1046; assemble_inline_todo!(::Core.Compiler.IRCode, ::Core.Compiler.OptimizationState)
╎    ╎    ╎    ╎    ╎     1 @Base\compiler\ssair\inlining.jl:703; analyze_method!(::Int64, ::Core.Compiler.Signature, ::Any, ::Core.SimpleVector, :...
1╎    ╎    ╎    ╎    ╎    ╎ 1 @Base\compiler\utilities.jl:131; specialize_method
1╎  1 @SoftSquishyMatter\src\pairinteractions.jl:42; macro expansion
Total snapshots: 9


No, that doesn’t look normal. You’re seeing a profile of the Julia compiler itself, not of your code. To avoid any compilation steps showing up in profiled code, I typically do:

my_function()  # ensure everything is compiled
Profile.clear() # ensure the profiler is cleared
@profile my_function()  # actually collect profiling data


The ProfileView.jl package also makes visualizing profiler results much easier.

3 Likes

I did

test = [mystruct(0.0, 0.0, false, 0.0, 0.0) for _ = 1 : 100000]
myfunction(test)
Profile.clear()
@profile myfunction(test)
Profile.print()


for my minimal example above. It took a few times to collect samples, but I finally got the following

Overhead ╎ [+additional indent] Count File:Line; Function
=========================================================
╎ 1 @Atom\src\eval.jl:41; macro expansion
╎  1 @Base\essentials.jl:709; invokelatest(::Any, ::Any, ::Vararg{Any,N} where N)
╎   1 @Base\essentials.jl:710; invokelatest(::Any, ::Any, ::Vararg{Any,N} where N; kwargs::Base.Iterators.Pairs{Union{}...
╎    1 @Atom\src\eval.jl:113; eval(::String, ::Int64, ::String, ::String, ::Bool)
╎     1 ...s\Michael Wang\.julia\packages\Media\ItEPc\src\dynamic.jl:24; macro expansion
╎    ╎ 1 @Atom\src\eval.jl:116; macro expansion
╎    ╎  1 @Atom\src\repl.jl:127; hideprompt(::Atom.var"#200#205"{String,Int64,String,Bool})
╎    ╎   1 @Atom\src\eval.jl:117; #200
╎    ╎    1 @Base\logging.jl:514; with_logger
╎    ╎     1 @Base\logging.jl:408; with_logstate(::Function, ::Any)
╎    ╎    ╎ 1 @Atom\src\eval.jl:118; (::Atom.var"#201#206"{String,Int64,String,Bool})()
╎    ╎    ╎  1 @Atom\src\eval.jl:9; withpath(::Function, ::String)
╎    ╎    ╎   1 @CodeTools\src\utils.jl:30; withpath(::Atom.var"#202#207"{String,Int64,String,Bool}, ::String)
╎    ╎    ╎    1 @Atom\src\eval.jl:119; (::Atom.var"#202#207"{String,Int64,String,Bool})()
╎    ╎    ╎     1 @CodeTools\src\eval.jl:30; include_string(::Module, ::String, ::String, ::Int64)
╎    ╎    ╎    ╎   1 ...uments\NYU\Research\Julia Simulation Package\TEST.jl:41; myfunction(::Array{mystruct,1})
╎    ╎    ╎    ╎    1 @Base\Base.jl:34; setproperty!
Total snapshots: 1


Line 41 is a.y += val which is the last thing since a.b is false. ProfileView just shows a bunch of equal length colored lines stacked on top of each other with no other features. It seems like there isn’t much of a way to speed up things outside of rewriting everything and using a different algorithm/container (e.g. storing everything in arrays).

Thanks I’ll take a look at this!

It looks like your code is too fast and profiling is only able to collect 1 sample, which in statistical terms is a really small sample and won’t give representative results. In general the profiled code should take a couple of seconds to run and produce at least a thousand samples.

If you are using Juno there is also @profiler macro.

Furthermore, I think profiling myfunction() is unlikely to be very informative in this case because the function is so simple. There’s no actual cost to getting or setting a field of a mutable struct (assuming that field has a concrete type in the struct definition) beyond the memory access itself, so the code produced by the compiler is just going to be a sequence of loads, additions, and stores. You’re not going to see a bunch of overhead from calling the setfield function or the + function because, by the time the compiler has optimized the code, those function calls will already have been inlined or eliminated.

I’m pretty confident in saying “no”. If you are frequently creating new instances of some mutable struct in a loop then yes, that can cause memory allocation and slow down your code. But just reading or writing the fields of a mutable struct is very cheap.

The only reason this might not be true is cache locality. Immutable structs in Julia can be stored inline in a Vector, meaning that adjacent particles in your array will have their data stored adjacently in memory. Mutable structs are not stored inline, so adjacent particles will not necessarily have their data stored nearby in memory. That additional indirection can cause your CPU to spend more time moving data in and out of its low-level caches.

Before going down the road of thinking about cache locality, however, I would suggest first trying all of the easy things:

• Is @use_threads doing the right thing? That is, how does single-threaded performance (without any threading macro usage) compare?
• Does your function allocate memory in its inner loop?
1 Like
• I used @code_warntype on compute_pair_interaction! and got no type issues. I also did the same for a simplified version where I removed all for loops and if statements (basically chose one particle and one neighbor and computed the force instead of looping over all of them). @code_warntype didn’t output anything bad. I’m not entirely sure this means my code is type-stable because there was an incident in the past where @code_warntype came back fine but it turned out that there was a statement like one_of_many_structs_grouped_under_abstract_type.shared_field that needed to be rewritten as ....shared_field::Float64 in order to have performant code.

• @use_threads is a macro that becomes Threads.@threads if lj.multithreaded is true; otherwise it becomes a normal for loop without threads (see post). With three threads, this gives about a x2 boost to speed compare to a single thread. Note that if I replace @use_threads lj.multithreaded with Threads.@threads, the performance is nearly identical.

• There are about 17 allocations (2.7 KiB) per function call. 16 of those allocations comes from the threading regardless if I use @use_threads or Threads.@threads. I haven’t quite found where that remaining allocation comes from, though it seems insignificant. Is it possible for @time or @btime to miss some allocations in a function?

Edit: I will try Duane_Wilson’s suggestion with Setfield and see if switching all my structs to immutable has any effect.

I think you are missing my point because I used the wrong example… In your original example you are writing to particle.x many times inside the loop when you don’t have to. Just use a local variable and write that to particle.x outside the loop. I find that can improve performance sometimes.

And I hadn’t realised neighbor was being loaded from the cell list inside the loop - so you obviously need to update that inside the loop then too my mistake. Setfield may help, but writing to the mutable struct may actually be faster as well, it’s hard to say. The memory layout of your struct is going to matter, possibly StructsOfArrays.jl will help there.

Have you considered the possibility that the code is already reasonably close to an optimal solution?
Do you have timings from an alternative implementation?
It seems that you are mainly looping over pids. How many pids are there for a typicall call?

1 Like

If I understand you correctly, do you mean take particle.x outside the while pid > 0 loop? If so, I tried that by using a local f_x,f_y, updating those, and then setting particle.f_x,particle.f_y outside the while loop. The performance is the essentially the same (I think probably because for this test system, the loop only works once. Actually your suggestion is probably how I should do it if the cells are larger and contain more than one particle). If I define a local variable like f_x,f_y inside the loop that is threaded, is it shared by all the threads or does each thread get its own copy?

Yes that is a good point. For non-interacting particles, my simulations runs at around 30k steps/s while with interactions it goes down to around 1k-2k, so I hoping to get somewhere in between those two numbers! But I guess there might not be much more I can do.

For very short ranged interactions like hard-ish disks in 2D (for which I am focused on now), the cell sizes are chosen so that you can essentially have only one particle per cell (possibly in very very rare instances two particles might somehow cram into one cell). However, in other types of simulations (e.g. different sized particles), there may be many pids per cell. There are other types of cell lists that I can use, but so far I’ve only implemented the simplest.

Edit: I do not know of other completely different implementations, though I am looking. I think this approach with cell lists is the most common used in other packages (e.g. hoomd-blue). There are definitely some improvements I can do to how I organize my data.

In your original code it’s inside a nested loop, writing multiple times to a mutable struct. It should be faster to do that once at the end and use local variables the rest of the time, especially when you say that is the most expensive part of the function.

1 Like

Oh my bad, I was looking at the wrong loop. Yes, I can definitely lift it outside the nested loops di,dj loops. Just to check, if I define f_x, f_y = 0.0, 0.0 inside the outer loop (the one that is threaded), are these local to each thread i.e. each thread updates it’s own copy of f_x, f_y? Thanks!

Edit: I did a simply test and it seems to be the case (they are at least local to the loop, but not sure if threads might mess with them)