Improving Barnes-Hut n-body simulation performance

Hi.

I’m writing a Barnes-Hut simulator in Julia but I’m experiencing a slowdown compared to the C version.

For 80k bodies:

  • C - 25 seconds
  • Julia - 35 seconds

As expected it spends 99.9% time in the compute_force function here: https://github.com/novoselrok/parallel-algorithms/blob/100f8a4709e3cbd575800c45aa6a6ff55baccdc8/nbody-bh/sequential-julia/cell.jl#L113

Here is the entire code: https://github.com/novoselrok/parallel-algorithms/tree/100f8a4709e3cbd575800c45aa6a6ff55baccdc8/nbody-bh/sequential-julia

Here is the 80k body test file: https://github.com/novoselrok/parallel-algorithms/blob/100f8a4709e3cbd575800c45aa6a6ff55baccdc8/nbody/data/test80k.txt.zip

You can run the code with: julia -O3 --check-bounds=no main.jl test80k.txt 1
Running the main function multiple times from the REPL does not help.

@code_warntype does not output any issues with compute_force function and I’m already using SVector everywhere. I’d be grateful if you could review the code and point out any more performance improvements I might have missed. Thank you!

(Side-note: here is the C version https://github.com/novoselrok/parallel-algorithms/tree/100f8a4709e3cbd575800c45aa6a6ff55baccdc8/nbody-bh/sequential-c)

CPU running on Ubuntu 18.04 and using Julia 1.0.3:

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              16
On-line CPU(s) list: 0-15
Thread(s) per core:  2
Core(s) per socket:  8
Socket(s):           1
NUMA node(s):        1
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               8
Model name:          AMD Ryzen 7 2700X Eight-Core Processor
Stepping:            2
CPU MHz:             2192.703
CPU max MHz:         3700,0000
CPU min MHz:         2200,0000
BogoMIPS:            7385.27
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           64K
L2 cache:            512K
L3 cache:            8192K
NUMA node0 CPU(s):   0-15

How long does one run of compute_force take, roughly?

Since you mutate Body etc, it might be more efficient for the vectors to be mutable and the struct Body to be immutable, but maybe you tried this? I see your commit note says you just changed from MVectors to SVectors, but was this how it worked?

It might be worth adding some @inlines to see, e.g. to add_force.

Aside, I don’t think your @inbounds are doing anything, nor most of your type annotations ::Float64 etc (which aren’t necessary for speed, but may point out errors). You also should not need to include("common.jl”) more than once. And I’d return nothing explicitly although after reading I think it is true that compute_force does do so.

@improbable22 compute_force really depends on the body itself and how deep down the tree you have to traverse to compute all the interactions. So the execution time varies wildly.

And yes, I was mutating the MVectors in place. Replacing MVectors with SVectors brought down the execution time from 60 seconds to 35 seconds.

All of the ::Float64 are unnecessary I agree, I added them just so I could be sure there is no type instability. And I’ll delete other unnecessary things like inbounds, because I was just trying all kinds of optimizations.

OK, but was Body a plain struct, not mutable, at the same time? And was this mutation inside @inbounds? For example at present @inbounds add_force(... does not apply to dx::Float64 = position[X] - body.position[X]… I believe… although @propagate_inbounds might…

Sorry this may be useless advice. Now I see that cell_mass also gets mutated.

Does whether a cell have children ever change? If not, it’s possible that making it Cell{cell_opened} would help?

Yep, it changes when inserting the bodies.

The following likely won’t help your speed relative to C, but as a general speedup, I find that for nbody interactions saving components in X, Y, Z is not favorable for speed. For instance to get the distance between two bodies, you have to compute the X, Y, Z distances in serial. If you held the positions in an array of size 3, you could loop through each index to compute the distances. I think it is safe to say that Julia does loops quite quickly. Just something for you to ponder. I had to make this decision with my own molecular dynamics code. But all testing I have done seems to indicate that using arrays for everything and avoiding separate arrays for X, Y and Z leads to greater than 15% speedup.

It looks like you may have been trying this out in one of your test codes.

In terms of style you can just use dist^3 and dx^2. No need for dis * dist * dist the compiler deals with that. And ! bangs on all your mutating functions would make this process a little easier.

Most of your @inbounds macro annotations don’t do anything. And I think all @propagate_inbounds is doing is inlining create_force(), which may or may not help.

In terms of speed improvements are you using ProfileView.jl to find the hot spots and BenchmarkTools.jl and @code_native to tune them? I wouldn’t do anything from here without checking that every change actually improves something worth improving.

But I’m guessing add_force is what matters, and this version is 20% faster for me:

function add_force!(body, position, mass)
    sum = zero(mass)
    for i in 1:3
        sum += (position[i] - body.position[i])^2
    end
    f = (-G * body.mass * mass) / sqrt(sum)^3
    body.force = body.force + Vec3((position .- body.position))
    nothing
end
1 Like

As far as I understand, position and body.position are SVectors.
You are calculating position - body.position (which is already implemented efficiently in StaticArrays twice, so you could write

function add_force!(body, position, mass)
    disp = position - body.position   # displacement vector
    sumsquared = sum(disp .^ 2)

    f = (-G * body.mass * mass) / (sumsquared ^ (3/2) )
    body.force += f * disp
    nothing
end

Broadcasting and scalar multiplication are efficient for SVectors.

1 Like

Your version is an order of magnitude slower… Mostly that’s because of the 3/2 power. But even fixing that it’s still 10%, slower, and very close to the original.

You could write it like this:

function add_force!(body, position, mass)
    d = position .- body.position
    sum = zero(mass)
    for i in 1:3
        sum += d[i]^2
    end
    f = (-G * body.mass * mass) / sqrt(sum)^3
    body.force = body.force + d 
    nothing
end

But it doesn’t change performance or even @code_native at all from my version. The compiler doesn’t do the subtractions twice anyway :slight_smile:

OK I probably needed the 3//2 power.
I don’t see why the for loop would be faster than the StaticArrays.jl version.

Could you please post the code you used for benchmarking?

Note that you forgot to multiply d by f in your version, which could account for the time difference.

You can also write dot(d, d) for the norm if you include the LinearAlgebra package.

It should be faster to calculate sqrt(sum) * sum instead of sqrt(sum)^3.

When timings are that close (within 2x) you often have to look at the generated code in Julia and C to see what optimizations happen and where (in this case) Julia doesn’t manage as well as the C compiler.

4 Likes

Hah you’re right.

3//2 is still really slow too btw, it doesn’t seem there’s an optimisation for that.

The original is pretty hard to beat, but your version is the winner if you use sqrt()

function add_force!(body, position, mass)
    disp = position - body.position   # displacement vector
    sumsquared = sum(disp .^ 2)
    f = (-G * body.mass * mass) / sqrt(sumsquared) ^ 3
    body.force += f * disp
    nothing
end
1 Like

Hey.

I wonder how does the code performs compared to C after all these optimizations.