You can run the code with: julia -O3 --check-bounds=no main.jl test80k.txt 1
Running the main function multiple times from the REPL does not help.
@code_warntype does not output any issues with compute_force function and I’m already using SVector everywhere. I’d be grateful if you could review the code and point out any more performance improvements I might have missed. Thank you!
How long does one run of compute_force take, roughly?
Since you mutate Body etc, it might be more efficient for the vectors to be mutable and the struct Body to be immutable, but maybe you tried this? I see your commit note says you just changed from MVectors to SVectors, but was this how it worked?
It might be worth adding some @inlines to see, e.g. to add_force.
Aside, I don’t think your @inbounds are doing anything, nor most of your type annotations ::Float64 etc (which aren’t necessary for speed, but may point out errors). You also should not need to include("common.jl”) more than once. And I’d return nothing explicitly although after reading I think it is true that compute_force does do so.
@improbable22compute_force really depends on the body itself and how deep down the tree you have to traverse to compute all the interactions. So the execution time varies wildly.
And yes, I was mutating the MVectors in place. Replacing MVectors with SVectors brought down the execution time from 60 seconds to 35 seconds.
All of the ::Float64 are unnecessary I agree, I added them just so I could be sure there is no type instability. And I’ll delete other unnecessary things like inbounds, because I was just trying all kinds of optimizations.
OK, but was Body a plain struct, not mutable, at the same time? And was this mutation inside @inbounds? For example at present @inbounds add_force(... does not apply to dx::Float64 = position[X] - body.position[X]… I believe… although @propagate_inbounds might…
Sorry this may be useless advice. Now I see that cell_mass also gets mutated.
Does whether a cell have children ever change? If not, it’s possible that making it Cell{cell_opened} would help?
The following likely won’t help your speed relative to C, but as a general speedup, I find that for nbody interactions saving components in X, Y, Z is not favorable for speed. For instance to get the distance between two bodies, you have to compute the X, Y, Z distances in serial. If you held the positions in an array of size 3, you could loop through each index to compute the distances. I think it is safe to say that Julia does loops quite quickly. Just something for you to ponder. I had to make this decision with my own molecular dynamics code. But all testing I have done seems to indicate that using arrays for everything and avoiding separate arrays for X, Y and Z leads to greater than 15% speedup.
It looks like you may have been trying this out in one of your test codes.
In terms of style you can just use dist^3 and dx^2. No need for dis * dist * dist the compiler deals with that. And ! bangs on all your mutating functions would make this process a little easier.
Most of your @inbounds macro annotations don’t do anything. And I think all @propagate_inbounds is doing is inlining create_force(), which may or may not help.
In terms of speed improvements are you using ProfileView.jl to find the hot spots and BenchmarkTools.jl and @code_native to tune them? I wouldn’t do anything from here without checking that every change actually improves something worth improving.
But I’m guessing add_force is what matters, and this version is 20% faster for me:
function add_force!(body, position, mass)
sum = zero(mass)
for i in 1:3
sum += (position[i] - body.position[i])^2
end
f = (-G * body.mass * mass) / sqrt(sum)^3
body.force = body.force + Vec3((position .- body.position))
nothing
end
As far as I understand, position and body.position are SVectors.
You are calculating position - body.position (which is already implemented efficiently in StaticArrays twice, so you could write
function add_force!(body, position, mass)
disp = position - body.position # displacement vector
sumsquared = sum(disp .^ 2)
f = (-G * body.mass * mass) / (sumsquared ^ (3/2) )
body.force += f * disp
nothing
end
Broadcasting and scalar multiplication are efficient for SVectors.
Your version is an order of magnitude slower… Mostly that’s because of the 3/2 power. But even fixing that it’s still 10%, slower, and very close to the original.
You could write it like this:
function add_force!(body, position, mass)
d = position .- body.position
sum = zero(mass)
for i in 1:3
sum += d[i]^2
end
f = (-G * body.mass * mass) / sqrt(sum)^3
body.force = body.force + d
nothing
end
But it doesn’t change performance or even @code_native at all from my version. The compiler doesn’t do the subtractions twice anyway
When timings are that close (within 2x) you often have to look at the generated code in Julia and C to see what optimizations happen and where (in this case) Julia doesn’t manage as well as the C compiler.